Estimation and selection for high-order Markov chains with Bayesian mixture transition distribution models

06/25/2019 ∙ by Matthew Heiner, et al. ∙ 0

We develop two models for Bayesian estimation and selection in high-order, discrete-state Markov chains. Both are based on the mixture transition distribution, which constructs a transition probability tensor with additive mixing of probabilities from first-order transition matrices. We demonstrate two uses for the proposed models: parsimonious approximation of high-order dynamics by mixing lower-order transition models, and order/lag selection through over-specification and shrinkage via priors for sparse probability vectors. The priors further shrink all models to an identifiable and interpretable parameterization, useful for data analysis. We discuss properties of the models and demonstrate their utility with simulation studies. We further apply the methodology to a data analysis from the high-order Markov chain literature and to a time series of pink salmon abundance in a creek in Alaska, U.S.A.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider modeling a time series of nominal or ordinal values collected at equally spaced, discrete times

. A popular approach for capturing serial correlation is to assume Markovian dynamics: that the conditional probability distribution of

depends only on the recent past. Time homogeneity, or time invariance of the transition probabilities, is also typically assumed. These simplifying assumptions, nearly essential for inference in small or moderate sample size scenarios, are often appropriate even if the time series is not truly Markovian. Another common assumption is to condition only on the single most recent lag. However, restricting a model to first-order dynamics, or even selecting the incorrect lag, can miss important features in the data. In this article, we propose Bayesian models to address two distinct objectives: estimation for the relevant time-delay coordinates, the Markovian order and important lags; and parsimonious modeling of high-order chains.

Assuming time homogeneity, a full, unrestricted first-order model requires estimation of discrete distributions, each with free parameters. A Markov chain of order requires estimation of such distributions, limiting consideration to low orders for most time series. Typically, order (or lag) is selected by maximizing a (possibly penalized) likelihood (Katz, 1981; Raftery, 1985; Prado and West, 2010), performing trans-dimensional MCMC (Green, 1995; Insua et al., 2012)

, using Bayes factors

(Fan and Tsai, 1999; Bacallado, 2011; Zucchini and MacDonald, 2009), predictive criteria, or goodness-of-fit tests (Bartlett, 1951; Besag and Mondal, 2013). Each of these approaches requires either fitting multiple models or complex estimation methods. Our approach is to build lag inference into a single model.

Several approaches have been proposed to address exponential growth in the parameter space for higher-order transitions. Raftery (1985) introduced the mixture transition distribution (MTD), a general-purpose, parsimonious model for Markov chains. The MTD model was extended in Raftery and Tavaré (1994) and developed over the subsequent decade. Berchtold and Raftery (2002)

provide a review. In the original MTD model, lags contribute to the transition probabilities by mixing over a single transition matrix. Only one new parameter is added for each additional lag. Despite its simplicity, the MTD framework can provide flexibility to capture nonstandard features, such as “outliers, bursts, and flat stretches,” as demonstrated by

Le et al. (1996) for a continuous-state version of the MTD.

Contemporary with the MTD model, generalized linear models for multinomial outcomes were applied to categorical time series (Liang and Zeger, 1986; Zeger and Liang, 1986; Fahrmeir and Kaufmann, 1987). These models can accommodate varying degrees of complexity by controlling the order of interactions among the linear predictors (lags), up to and including a full model with parameters. These models can also account for exogenous sources of non-stationarity through covariates. However, estimation and interpretability become problematic in these models when many lags are considered.

Tree-based methods provide an alternative parsimonious approach. Variable-length Markov chains (VLMC, Ron et al., 1994; Bühlmann et al., 1999) reduce the parameter space by clustering the transition distributions via recursive pruning. Sparse Markov chains (SMC, Jääskinen et al., 2014) partition the -dimensional lag space without hierarchical constraints, resulting in greater flexibility. They also feature a prior structure which encourages low orders. Although efficient, these models lack posterior uncertainty quantification, and inferences for order and lag importance are not readily available.

More recently, Sarkar and Dunson (2016) proposed a Bayesian nonparametric model for high-order Markov chains. They model the transition distributions through tensor factorization and further encourage parsimony by clustering the components of a core mixing distribution with a Dirichlet process prior (Ferguson, 1973). By allowing variable dimensions along different modes of the core mixing distribution, the model further admits inferences for lag importance. This model enjoys a fully Bayesian, albeit complicated, implementation and performs well against the methods described above in forecasting when there are up to four states and ten lags.

Our modeling strategy is to build on the simplicity and interpretability of the MTD model. One popular extension of the MTD, referred to by Berchtold and Raftery (2002) as the MTDg model, utilizes a separate transition matrix for each lag. While this more flexible model grows linearly with each additional lag, it is not identifiable (Lèbre and Bourguignon, 2008). Recently, Tank et al. (2017) used a reparameterization to establish a unique and identifiable characterization of the MTDg model in the context of multiple time series. Using a penalized likelihood and proximal gradient optimization, they softly enforce the identifiability conditions and simultaneously select relevant series to infer Granger causality. We propose a Bayesian estimation approach to the MTDg model which utilizes priors introduced by Heiner et al. (2019) to promote shrinkage toward the identifiability conditions of Tank et al. (2017), and to simultaneously select relevant lags. These priors were previously demonstrated to effectively select a single active lag using the original MTD model. We then propose an extension which allows for higher-order interaction between lags, as well as inference for the Markovian order (i.e., the number of active lags) up to a pre-specified maximum.

The remainder of this paper is organized as follows. In Section 2

, we review the MTD model and develop our proposed extensions. We outline our approach for Bayesian inference using structured priors to aid with the models’ intended uses in Section

3. In Section 4, we test the models using two simulation scenarios that reflect our two objectives, demonstrating improved predictive performance over the original MTD. Section 5 illustrates the models through two analyses, first on a data set which appears in the preceding literature, and second on annual time series of pink salmon abundance in Alaska, U.S.A. Finally, we conclude with a summary in Section 6. Technical details are provided in the appendices.

2 Models

In a full -order, time-homogeneous Markov chain, the collection of all possible transition probabilities , for , , , can be arranged in a -order tensor . If we condition on the first observations of the time series, the joint sampling distribution for the remaining sequence is given by , defining the conditional likelihood that we employ hereafter. We begin by specifying the original MTD model in Section 2.1 and motivate its extensions. In Section 2.2, we discuss the MTDg extension and associated identifiability results. We then introduce a Bayesian formulation for the MTDg which uses priors for sparse probability vectors in Section 2.3. Finally, we propose an extension to include higher-order transitions in Section 2.4.

2.1 Original mixture transition distribution

The mixture transition distribution model constructs the transition probability tensor

as linear combinations of probabilities from a single column-stochastic matrix

and adds just one parameter for each additional lag

, similar to autoregressive models. The transition probabilities in a model of order

are given as


where , and . Although this model incorporates information beyond the first lag, it is restrictive in that it cannot capture nonlinear (non-additive) dynamics in more than one dimension of the lag space.

Form (1) suggests that lags which play a prominent role in the transition probability for will have relatively large and lags which are not important to the transition will have values near 0. Hence, inferences for potentially yield information about important lags for the Markov process. It is apparent from (1) that is sufficient for conditional independence of and . If the columns of are unique, then is also a necessary condition for conditional independence. Inferences on have been employed to understand lag importance informally (Raftery and Tavaré, 1994), although the standard method for assessing order has been to compare BIC values (Berchtold and Raftery, 2002). Heiner et al. (2019) use a single model, relying on inferences on for insight into lag importance. We adopt the same approach here.

2.2 MTDg, identifiability, and lag selection

The MTDg model modifies (1) by using a distinct column-stochastic matrix for each lag . While this increases flexibility and allows for different transition types associated with each lag, the model lacks identifiability. Tank et al. (2017) demonstrate this by introducing an intercept probability vector, , extending to include , and reparameterizing the transition probabilities through the products and , resulting in the MTDg formulation


One can then freely transfer probability mass by subtracting some vector
from each column of and adding it to while preserving all values in . Selecting to be the minimum value in the th row of for each , and following the transferral procedure just described for , results in a maximally reduced parameterization in the sense that the highest probability mass possible has been transferred to the intercept while maintaining non-negativity of all elements in . Let denote the resulting parameters after the reduction procedure so that and , for , , and . Tank et al. (2017) show that this maximal reduction yields a unique representation for every MTDg. Furthermore, one can view the reduced model in the original parameterization using with , and , for , and invariant to choice of ; probability vector ; and column-stochastic matrices . Then is interpretable as a marginal contribution of the th lag to the transition distribution. Thus, with careful construction, we may use to make inferences about lag importance even though the MTDg is overparameterized. Furthermore, the intercept allows us to infer the possible lack of serial dependence in a direct way.

To operationalize this reduction in an estimation procedure, Tank et al. (2017) show that solutions to a penalized likelihood based on (2), in which the penalty increases with respect to the absolute value of the entries in the (excluding the intercept), meet the maximal-reduction criterion. Their proposed soft penalty functions equivalently regularize for . Because is sufficient and necessary (as long as the columns of are distinct) for conditional independence of the current state from lag , the penalized estimate simultaneously detects lag relevance (or Granger causality in the case where indexes multiple time series).

2.3 Bayesian MTDg with priors for sparse probability vectors

We now present a Bayesian modeling approach to the MTDg, which admits full characterization of uncertainty. Rather than addressing the constraint that each column of sum to , we work with the original and parameters and employ carefully chosen prior distributions that shrink toward the identifiable and interpretable and . In this and the following section, we employ two prior distributions for probability vectors, introduced by Heiner et al. (2019), which go beyond the standard Dirichlet prior by enforcing sparsity in the presence of data, as well as conditional stochastic ordering. Both priors are continuous, bypassing problems that arise from the sum-to-one constraint when using priors with point masses.

The first is the sparse Dirichlet mixture (SDM) prior. This one-parameter extension of the Dirichlet distribution is a fixed-weight mixture of Dirichlet distributions, with each component featuring a boost of equivalent sample size in one of the categories. If is a probability vector of length , the SDM density is given as


where denotes the Dirichlet density with shape parameter vector , with , where is a vector of 0s with a 1 in the th position, and with denoting the gamma function. For small sample sizes and relatively large , the SDM can be described as a winner-takes-all prior in that it shifts most mass toward the corner of the supporting simplex for the component with largest .

The second prior is the stick-breaking mixture (SBM) prior. This prior builds the probability vector through an extension of the stick-breaking construction that defines the generalized Dirichlet distribution (Connor and Mosimann, 1969). In particular,



independently drawn from a mixture of three beta distributions,

, where . This mixture structure encourages sparsity by setting large, in which case the first component corresponds to small probabilities in . The third component allows for the rest of the unbroken stick (i.e., ) to be used for , while the second mixture component allows for flexibility in modeling . The and parameters can be fixed at the same values across , or can be set to mimic the Dirichlet distribution as with the generalized Dirichlet distribution, resulting in a three-parameter extension of the Dirichlet distribution. To accommodate our proposed extensions of the MTD model, we adapt the SBM prior to allow the mixture weights , , and to vary with .

Beyond the properties noted, one important advantage of the SDM and SBM priors is their conjugacy and resulting computational tractability. If the hyperparameters of the priors are fixed, as is usually the case with the original Dirichlet priors, incorporating them into a hierarchical model involving multinomial counts (latent or observed) requires minimal effort since posterior Gibbs sampling proceeds with conditional distributions that can be directly sampled, allowing us to swap priors without structural changes to the updates in Appendix


If a modeler believes that exactly one lag influences the transition distribution, the SDM prior can be used in the single- MTD model, as demonstrated in Heiner et al. (2019). However, this is not recommended for the MTDg model. If is not sufficiently high, the SDM prior may distribute posterior mass to a non-unique and non-interpretable configuration of . We instead use a SBM prior for that favors the unique reduction and more appropriately allows for dependence on multiple lags. Here, the stick-breaking construction of the SBM provides intuition, as is drawn first, and the rest of is broken sequentially from what remains in the unbroken stick. To avoid penalizing the intercept, we set for only and use either

(the uniform distribution) or beta shape parameters that favor large values of

. The remaining beta mixtures use and to regularize . Setting allows for small values of the corresponding , effectively skipping the th lag. Setting promotes consumption of the remaining mass before reaching . If and across and is relatively high, the sequential SBM construction can further regularize via stochastic ordering, consistent with the common assumption that recent lags should carry more influence.

The model is completed with prior distributions for . The traditional choice for transition matrices is to use independent Dirichlet distributions for each column, which we adopt here. Heiner et al. (2019) found it advantageous to use independent SBM priors for each column of (in the standard MTD model) in cases of nearly deterministic dynamics. However, the MTDg model spreads estimation across multiple matrices, relying more heavily on the non-symmetric SBM prior. This can potentially introduce undesired artifacts to the estimated transition probabilities.

The full hierarchical model specification for the MTDg and details for posterior inference are discussed in Section 3 and Appendix B.

2.4 Mixtures of higher-order MTD components

The MTD and MTDg models offer parsimonious and interpretable representations for Markov chains with dependence extending beyond the most recent lag. However, these models are strictly additive in the sense that any dynamics of order higher than one (i.e., more than one active lag) are approximated with linear combinations of first-order transitions. In their survey of generalizations for the MTD, Berchtold and Raftery (2002) suggest, but do not pursue, the possibility of mixing over higher-order transition tensors. We build a Bayesian framework for such an extension to include higher-order “interactions,” and we refer to the resulting model as the mixture of mixture transition distributions (MMTD) model.

To define the MMTD model, let be a positive integer representing the highest-order transition tensor over which we will mix. Thus we have , a length- probability vector; , a transition matrix; , a transition tensor; and so forth up to , a transition tensor, such that for all . Next, introduce a mixing probability vector across orders, . The MMTD model for transition probabilities is then given by


where is a probability vector of length for . This mixture of mixtures is equivalent to using a single (albeit long) probability vector to mix over all possible arrangements of lags and base transition tensors . However, this parameterization is more informative about important orders (via inference for ) in addition to lags (via inference for ). If , we recover the original MTD model. The fully-parameterized transitions associated with allow unrestricted dynamics in dimensions of the lag space. As a discrete mixture of probability distributions, this model produces a valid probability tensor.

The model in (2.4) is clearly over-parameterized, and consequently , , and are not fully identified. Defining a reduction procedure similar to that of Tank et al. (2017) for the MMTD is more nuanced. One complication arises because the result is dependent on the order of reduction. For example, one may first transfer probability mass to the intercept from all higher-order transition tensors, followed by transfer to the first-order transitions by defining elements of the vector as the minima over indexes in , for , which correspond to a unique value of the lagged state for the lag associated with the current (allowing for such matrices). However, transferring first to the associated with lag 1 and then to the associated with lag 2 does not yield the same result if we reverse the order. Alternatively, one could define a reduction process in terms of projecting first onto an intercept, then projecting what remains onto the space spanned by the first-order level of the MMTD, and so forth. Thus, the intercept has the first opportunity to describe the base probabilities, then the first-order level of the model has the next opportunity to capture first-order dynamics, and each additional level fills in what lower-order levels cannot adequately model. Absent a formal procedure, we note that in estimation, this process would be implemented with regularization, for which the sequential SBM prior is well-suited. We therefore propose using a SBM prior for , similar to the one used in Section 2.3 for the MTDg model.

Even under regularization of model order, it is possible for a high-order to mimic a lower-order tensor through repetition of transition probabilities across values of a certain lag. Rather than build complicated constraints into the model, we note that this issue can be detectable through inferences for . Specifically, a modeler may plot posterior estimates of the distributions in and check for repeating patterns, especially patterns that coincide with a slice of the tensor (e.g., all columns equal in in a model would indicate that the intercept alone is adequate). We strongly recommend following this practice before interpreting model inferences for and .

We envision two primary uses for the MMTD model. The first is to uncover low-order structure from data whose practical lag dependence horizon is truly smaller than our over-specified and the order is less than or equal to our selected , in which case the true model is contained within the mixture framework. For example, we might postulate that a time series has second-order dependence, but we are unsure which two lags are important. Assuming a maximal lag horizon of 10, we could fit the MMTD model with and . Because there is only one at each level, we could use the SDM prior on each with a large value of to select the appropriate lag configuration and discourage mixing lower-order transitions. If the dynamics are truly second-order, we would anticipate that would carry substantial posterior weight, and that inferences for would identify the influential lags. In this model-selection scenario, the SDM (on ) and SBM (on ) priors play an important role in selection and interpretation.

If the true order of dependence in the time series is greater than , our second intended use for the MMTD model is analogous to that of the MTD and MTDg, wherein we parsimoniously approximate higher-order dependence by mixing lower-order transition distributions. Adding the higher-order tensors could be thought of as including interaction-like terms in the mixture. In this scenario, one may still use the SDM prior for each , but with a lower value of to encourage more mixing (note that yields a Dirichlet prior). The SBM prior on further allows mixing across orders, so that different levels of the model may mix across non-overlapping sets of lags.

As with the MTDg, we recommend using independent Dirichlet priors for each of the probability distributions in for . If and are small, is large, and the transition probability tensor is known to be sparse, it may potentially be advantageous to replace these Dirichlet priors with independent SBM priors. However, we strongly urge caution, as information from the data is spread thin and the SBM prior is not symmetric. Heiner et al. (2019) discuss a strategy for promoting more Dirichlet-like behavior in the SBM prior.

Our proposed model formulation requires estimation of free parameters, free parameters, and free parameters in . The fastest-growing term in the parameter count increases no faster than a polynomial in of degree divided by , while the transition distributions grow exponentially. Table 1 reports the total number of parameters to estimate for different combinations of , , and . Typically, is fixed and known, and a modeler must select and considering parsimony, estimability for a given sample size, and computational cost. If is much smaller than , the MMTD substantially reduces the parameter space from the original full-order Markov chain. The parameter space is effectively further reduced by the sparsity-inducing priors on and .

The hierarchical model specification for the MMTD and posterior inference details are discussed in Section 3 and Appendix C.

Total Unrestricted
2 5 2 2 13 7 22 32
2 5 4 4 26 31 61 32
2 10 2 2 53 7 62 1,024
2 10 4 4 381 31 416 1,024
5 5 2 2 13 124 139 12,500
5 5 4 4 26 3,124 3,154 12,500
5 10 2 2 53 124 179 3.91
5 10 4 4 381 3,124 3,509 3.91
7 5 2 2 13 342 357 100,842
7 5 4 4 26 16,806 16,836 100,842
7 10 2 2 53 342 397 1.69
7 10 4 4 381 16,806 17,191 1.69
Table 1: Free parameter count for MMTD model under different combinations of state-space size , largest possible lag , and largest mixing order . The total number of parameters is the sum of the free , , and parameters. The unrestricted total is the number of parameters required to estimate an unrestricted transition probability tensor of order .

3 Bayesian inference and computation

We now address implementation of the MTDg and MMTD models. To obtain full posterior inference, we utilize a Gibbs sampler which alternates between collapsed and full conditional distributions made tractable by augmentation with latent configuration variables (Insua et al., 2012). As noted earlier, all inferences condition on the first observations in the time series .

3.1 MTDg model

The sampling distribution for the MTDg model is


We first break the mixture in (6) by introducing latent indicators such that for independently across . Adding the priors yields the full hierarchical model. For ; ; ; and , we have


where each is a length- vector of positive shape parameters; and are length- vectors containing probabilities such that ; and and are length- vectors containing positive shape parameters. We always set to avoid penalizing , and also recommend setting .

This structure admits closed-form full conditional distributions for all of , , and . Specifically, the update for is a conjugate SBM-multinomial update using aggregated counts of . Each can be updated with a discrete distribution involving and elements of as they appear in the likelihood. Given , we can aggregate the transition counts into sufficient statistics and , a set of matrices. For example, if , , and , we would increment . The full conditional distribution for the intercept is then an updated Dirichlet distribution with providing the multinomial counts. Likewise, the update for involves a conjugate Dirichlet-multinomial update with its corresponding count vector , for each and .

As is common with mixture models, the full joint posterior distribution is multimodal and the Gibbs sampler described above is prone to poor mixing. To improve mixing, we modify the Gibbs sampler just described in two ways. First, we integrate all parameters out of the full joint posterior. This affects only the full conditional distributions for the configuration variables , which are drawn from a (different) discrete distribution. The second modification is an occasional (every 10 iterations) hybrid Metropolis step that jointly proposes and from the prior in order to encourage exploration. Ordinarily, the prior is inefficient as a proposal distribution. While the sparse configurations proposed by the SBM prior help mitigate this issue, we still advocate running multiple long MCMC chains to ensure adequate mixing. Full details for the modified Gibbs sampler are provided in Appendix B.

3.2 MMTD model

Our implementation for the MMTD model is analogous to the MTDg model, with a few notable extensions. As before, the sampling distribution for the time series is given as a product of transition probabilities in (2.4) across . We again break the mixture in (2.4) by introducing latent configuration variables such that , for , independently for each observation time. Then conditional on (and for ), further introduce such that , for , independently for each observation time. The hierarchical formulation for this model is given in generative order as follows. For ; ; ; , ; and , we have


where is a length- vector of positive shape parameters (which could potentially be separately specified for each distribution in each ); and are length- vectors containing probabilities such that ; and are length- vectors containing positive shape parameters; is a length- vector of positive shape parameters; and is the SDM sparsity parameter. We always set to avoid penalizing , and recommend setting as well. Note that all quantities in (3.2) without explicit dependence are considered independent a priori.

As with the MTDg, posterior simulation can be accomplished entirely through closed-form Gibbs sampling. To simplify computation, we uniquely map all and pairs onto a single variable

whose prior probability under the model is equal to the product of the corresponding

and . Full conditional distributions for , each , and each probability vector in are exactly analogous to multinomial-SBM, multinomial-SDM, and multinomial-Dirichlet conjugate updates, respectively, where , , and observed data transitions supply the respective multinomial counts. Full conditional updates for and (equivalently ) require calculation and sampling from a discrete distribution. Full details are given in Appendix C.

We again improve mixing in the sampler by integrating from the joint posterior, sampling the collapsed conditional distributions for , each , and . These are supported by the tractable marginal distributions reported in Appendix A. Additionally, to encourage occasional jumps between modes of the posterior, we include a hybrid independence-Metropolis step which jointly proposes , each , and from their joint prior every 10 iterations of the MCMC algorithm.

The augmented Gibbs sampler becomes computationally demanding as and increase because updates for the latent configuration variables involve calculation of probabilities for each time point . Random-walk Metropolis samplers for , , and utilizing the mixture likelihood based on (2.4) may provide an alternate strategy if

is reasonably small. The logit-normal distribution

(Atchison and Shen, 1980), or multivariate Gaussian random walks on the logit scale, facilitate properly constrained proposals for probability vectors.

4 Simulation study

To demonstrate the effectiveness of the MMTD model for both objectives and to compare transition probability estimation performance with existing methods, we report two simulation studies. Both simulation scenarios feature time series generated from true Markov chains of differing order and lag configuration. In Simulation 1, the true generating model is a third-order chain with three states () in which transition probabilities depend on lags 1, 3, and 4. In Simulation 2, the true generating model is a fifth-order binary chain () for which each of the first five lags contributes to transition probabilities. In both models, each distribution in the transition tensor was drawn from a uniform distribution on the simplex (i.e., symmetric Dirichlet distributions with all shape parameters equal to 1). Each chain was randomly initialized and run for 1,000 steps of burn-in. The first 1,000 samples thereafter were reserved for training data and the next 1,000 for validation.

To evaluate estimation of transition probabilities, each model was fit using the prescribed number of training samples, and point estimates of the transition distributions were compared to the true transition distributions for each of the 1,000 validation points. Specifically, for validation time point , each model produced a vector to estimate each , for . In Bayesian models, the point estimate is the Monte Carlo-computed posterior mean of . In non-Bayesian models, is computed from the optimized model fit. For each validation time point, we computed the loss given by . The reported loss metric for model comparison is , that is, 100 times the mean loss across the validation points.

We fit the MTD, MTDg, and MMTD models to each training set with various settings. Implementation for the MTD is similar to the MTDg and is described in Heiner et al. (2019). Let and denote the respective model fits with user-specified maximum lag horizon , and let denote a model fit with user-specified maximum lag horizon and maximum order . All transition distributions in all and in all three models utilize symmetric, unit-information Dirichlet priors (i.e., whose shape parameters all equal so that they sum to unity).

We use two prior settings for the MTD model. The first employs a Dirichlet prior for with all shape parameters equal to . The second setting uses a SBM prior for with , , , and selected to mimic the Dirichlet prior with shape parameters equal to and sparsity correction on (Heiner et al., 2019). This prior encourages a moderate level of sparsity as well as decreasing prior probability for higher lags.

The MTDg model uses a SBM prior for with for and thereafter; for and thereafter; ; , yielding a uniform prior for , and remaining elements of and selected to mimic a Dirichlet prior with shape parameters equal to and sparsity correction on . This prior avoids penalizing , encourages a moderate level of sparsity in the remaining lags, and steeper decrease in prior probability for higher lags than used for the MTD.

We use two prior settings in the MMTD models. Both follow (3.2) with
, but the first setting uses for all , resulting in the symmetric, unit-information Dirichlet prior. The second setting uses to encourage selection of a single-lag configuration within level . In both prior settings, we employ the SBM prior for with for and thereafter; for and thereafter; ; and for all second-component beta distributions, yielding a uniform prior for . This prior avoids penalizing , allows for sparsity in the remaining lags, and maintains soft ordering that favors lower levels of the model. Because is typically kept to small values, it is important that not be large and that not be too small. Otherwise, the prior can inappropriately allocate substantial mass toward large values of . We recommend checking for this condition as part of prior sensitivity analysis.

To obtain results in Sections 4.1, 4.2, and 5, each model was initialized with random draws from Dirichlet distributions for and each . Random initialization in these models calls for long burn-in periods, on the order of tens to hundreds of thousands of iterations. In our analyses, 200,000 burn-in iterations were followed by another 400,000 iterations. Reported posterior quantities were calculated using a thinned sample retaining every 200th iteration. These conservative settings produced (unless otherwise noted) stable chains suitable for inference. In some cases, parallel chains sampled from MMTD models settled in neighboring modes which had minor impact on inferences and performance. With all models, posterior sampling was conducted using the Julia scientific computing language (Bezanson et al., 2017).

In addition to our proposed models, we fit the multinomial generalized linear models with logistic link functions to each training set using the VGAM package in R (Yee et al., 2010). To distinguish different settings, we denote model fit as with maximum lag horizon and highest interaction order among the linear predictors . We also fit the variable length Markov chain models, denoted VLMC, using the VLMC package in R (Maechler, 2015) and employing default model settings.

4.1 Simulation 1 results

All models were fit to the time series from Simulation 1 for two sample sizes, and . Here, we assume that the modeler is considering up to a horizon of six lags, which we use where possible to promote equitable comparisons. Results of the mean loss across the 1,000 validation points are given in Table 2. In addition to transition probability estimation, we are interested in inferences for Markovian order and important lags afforded by the MTD, MTDg, and MMTD models. With exception of the MTD and MTDg, we see improved estimation with the larger sample size across all models.

Model Loss
LogitMC(6, 1) 18.70
LogitMC(6, 2) 37.42
LogitMC(3, 1), Lags 1, 3, 4 only 17.14
LogitMC(3, 2), Lags 1, 3, 4 only 13.70
LogitMC(3, 3), Lags 1, 3, 4 only 12.64
VLMC 19.12
MTD(6), Dir() 17.29
MTD(6), SBM() 17.27
MTDg(6) 17.30
MMTD(6, 2), Dir() 14.92
MMTD(6, 2) 14.78
MMTD(6, 3), Dir() 14.65
MMTD(6, 3) 13.72
MMTD(6, 4), Dir() 14.93
MMTD(6, 4) 14.24
MMTD(6, 5), Dir() 15.13
MMTD(6, 5) 14.40
Model Loss
LogitMC(6, 1) 16.77
LogitMC(6, 2) 18.84
LogitMC(3, 1), Lags 1, 3, 4 only 16.39
LogitMC(3, 2), Lags 1, 3, 4 only 10.40
LogitMC(3, 3), Lags 1, 3, 4 only 7.64
VLMC 15.26
MTD(6), Dir() 17.27
MTD(6), SBM() 17.21
MTDg(6) 17.05
MMTD(6, 2), Dir() 13.77
MMTD(6, 2) 13.83
MMTD(6, 3), Dir() 7.55
MMTD(6, 3) 7.44
MMTD(6, 4), Dir() 7.58
MMTD(6, 4) 7.48
MMTD(6, 5), Dir() 7.56
MMTD(6, 5) 7.47
Table 2: Simulation 1 ( states for a third-order chain with active lags 1, 3, and 4). Results for transition probability estimation under various models and model settings using two sample sizes, and . The reported loss is 100 times the mean loss, computed across 1,000 validation time points. Within each sample size group, the lowest mean loss is highlighted with bold font.

4.1.1 Sample size 200

In the case, the multinomial logistic models produce the best and worst results. Fitting all second-order interactions for up to six lags is cumbersome in this model, resulting in poor estimates. Fitting the full-order model to the correct lags only produces accurate estimates. However, this would require preliminary results from an iterative process which may or may not select the correct model and does not account for model uncertainty. We emphasize here that our proposed models do not require a model selection process if the modeler specifies the maximum lag horizon and maximum order , as order and lag inferences are built-in.

The variable length Markov chain model offers no improvement, possibly because the dynamics governing Simulation 1 skip lag 2. VLMC branches utilizing more distant lags must pass through and include lag 2. This results in a missed opportunity for greater parsimony (Jääskinen et al., 2014).

The MTD and MTDg models offer little help in this scenario because Simulation 1 is third-order with non-additive interactions. Posterior densities for (not shown) reveal that is favored under the Dirichlet prior (in the MTD) and dominates with the SBM prior (in both the MTD and MTDg). The latter effectively produces a first order Markov chain dependent on the third lag.

Several MMTD models were fit with increasing maximum order ranging from 2 to 5. The second-order model provides a substantial improvement over mixing first-order transitions and fits nearly as well as the correctly specified third-order model. As expected, estimation performance stops improving when exceeds the true order of three. Using SDM priors on parameters improves estimation, but more so when the correct model is contained in the specified model.

In the model, posterior inference supports second-order dynamics with the lag 3,4 combination receiving most posterior weight. Adding the SDM priors on results in stronger support for the same conclusion. In the model, posterior inferences support second or third-order dynamics with lags 3 and 4 receiving most posterior weight. Adding the SDM priors led to weakly favoring order 3 (selecting lags 1, 3 and 4) in one model run. The model most often supports second-order dynamics (lags 3 and 4) under both prior scenarios. Results from the model are similar to the model.

It appears that the signal associated with lag 1 is relatively weak when . The SBM prior on shrinks inferences toward second-order, but not decidedly away from third-order dynamics. Overall, the MMTD consistently produces the most faithful estimates of transition probabilities from a single model without requiring iterative model selection.

4.1.2 Sample size 500

In the case, even the multinomial logistic models fit directly to the correct lags only fail to outperform the MMTD with . With a larger sample size, the VLMC model is more competitive, but the MTD and MTDg remain insufficiently flexible to capture the structure. The MTD model mixes over lags 2, 4, and 5 with a Dirichlet prior on , and primarily over lags 4 and 5 with the SDM prior. The MTDg concentrates some mass on lags 3 and 4.

In this large-sample scenario, MMTD performance improves substantially when the specified mixture model contains the true model structure (i.e., ), although the second-order MMTD again improves over the first-order additive MTD and MTDg. In the MMTD models with , posterior mass concentrates on the correct order and lag configuration. Furthermore, we see little drop in performance when the maximal order is over-specified. The SDM priors on appear not to significantly improve estimation, and inferences are qualitatively similar. The inferences from the model are similar to those of the models fit to observations. Among the models considered in this simulation scenario, the MMTD consistently produces the most faithful estimates of transition probabilities.

4.2 Simulation 2 results

All models were fit to the time series from Simulation 2 for three sample sizes: , and . Here, we assume that the modeler is considering up to a horizon of seven lags, which we use where possible to promote equitable comparisons. Results of the mean loss across the 1,000 validation points are given in Table 3. Again, we examine order and lag inferences from the MTD, MTDg, and MMTD models in addition to estimation performance.

Model Loss
LogitMC(7, 1) 24.30
LogitMC(7, 2) 26.15
LogitMC(7, 3) n/a
LogitMC(5, 1) 24.49
LogitMC(5, 2) 20.86
LogitMC(5, 3) n/a
LogitMC(5, 4) n/a
VLMC 20.52
MTD(7) 24.71
MTD(7) 24.50
MTDg(7) 24.01
MMTD(7, 4) 23.68
MMTD(7, 4) 23.21
MMTD(7, 7) 23.68
MMTD(7, 7) 23.33
Model Loss
LogitMC(7, 1) 20.03
LogitMC(7, 2) 16.26
LogitMC(7, 3) 18.70
LogitMC(5, 1) 20.25
LogitMC(5, 2) 16.24
LogitMC(5, 3) 11.26
LogitMC(5, 4) n/a
VLMC 15.45
MTD(7) 22.47
MTD(7) 23.31
MTDg(7) 23.93
MMTD(7, 4) 15.26
MMTD(7, 4) 15.38
MMTD(7, 7) 14.13
MMTD(7, 7) 13.93
Model Loss
LogitMC(7, 1) 18.53
LogitMC(7, 2) 14.66
LogitMC(7, 3) 13.67
LogitMC(5, 1) 18.90
LogitMC(5, 2) 15.35
LogitMC(5, 3) 8.29
LogitMC(5, 4) 7.79
VLMC 12.13
MTD(7) 19.82
MTD(7) 19.59
MTDg(7) 19.68
MMTD(7, 4) 12.15
MMTD(7, 4) 14.70
MMTD(7, 7) 7.59
MMTD(7, 7) 7.38
Table 3: Simulation 2 ( states for a fifth-order chain with five active lags). Results for transition probability estimation under various models and model settings using three sample sizes: , and . The reported loss is 100 times the mean loss, computed across 1,000 validation time points. Within each sample size group, the lowest mean loss is highlighted with bold font.

4.2.1 Sample size 100

In the case, high order interactions are not estimable in the multinomial logistic model. The VLMC model performs best in this scenario, presumably because Simulation 2 features no gap in relevant lags.

Because the simulation uses lags 1 through 5, the MTD(7), MTDg(7), and MMTD(7, 4) models are under-specified and must rely on a lower-order sub-model and/or mixing across lags to approximate the fifth-order dynamics. The MTD models mix primarily over lags 2 and 4, while the MTDg model concentrates on lag 2. The MMTD(7, 4) models mix primarily over low orders, with slight preference for lag 2. The over-specified MMTD(7, 7) models do not outperform the models, and produce qualitatively equivalent inferences for and . It is apparent that the small sample size is insufficient to capture the fifth-order structure.

4.2.2 Sample size 200

With , the time series is long enough to include third-order interactions in the multinomial logistic model, which performs well. The VLMC model is again competitive with the MMTD and generally outperforms the logistic models.

As before, the MTD and MTDg models are unable to leverage increased sample size to the extent that the other models can. The MTD models mix primarily over lags 1 and 5, while the MTDg model mixes primarily on lag 1 (due to the ordered prior on ).

The higher-order interactions allowed by the MMTD become advantageous with , making this model competitive. The MMTD(7, 4) models concentrate posterior mass on order four and lags 1, 2, 3, and 5. Posterior mass in the MMTD(7, 7) model is split between order four and five, again demonstrating the shrinking effect of the SBM prior on . Different runs of the MCMC chain favor lag configurations and . SDM priors on had a minor concentrating effect on the posterior densities.

We note that the over-specified MMTD(7, 7) with a less strictly ordered prior on (such as the SDM) can outperform the correctly specified logistic model in this scenario. However, inferences from such models can be suspect, as they do not shrink toward the “reduced” and identifiable parameterization. While we favor reliably interpretable inferences, one may consider modifying or replacing the SBM prior on to improve predictive performance.

4.2.3 Sample size 500

The fifth-order binary chain in Simulation 2 has 32 total (univariate) transition distributions which are easily estimated with 500 samples. Therefore, the multinomial logistic models with high-order interactions approach the performance of the over-specified MMTD models. The VLMC is also competitive. Again, the MTD(7) and MTDg(7) models lag noticeably behind in estimation performance, although both attempt to mix over multiple lags.

The MMTD(7, 4) with a Dirichlet prior on again concentrates on order four and lags 1, 2, 3, and 5. The same model with the SDM prior selects lags 3, 4, 5, and 6, and performs noticeably worse (loss of 14.70). A second MCMC run places most weight on orders three and four, and lags 1, 3, 4, and 5, resulting in average loss of 12.65. This highlights multimodality of the posterior and the need for replicate MCMC runs.

The MMTD(7, 7) decisively identifies the correct order and lag structure resulting in the best estimation performance. We conclude that the MMTD consistently produces the most faithful estimates of transition probabilities from a single model without requiring iterative model selection.

5 Data illustrations

We now apply the MTDg and MMTD models to two data analyses. The first data example was studied with the original MTD and in the subsequent literature. The second is an analysis of pink salmon population dynamics in Alaska, U.S.A. during the twentieth century. We illustrate the use of inferences on order and lag importance available from the models.

5.1 Seizure data

Berchtold and Raftery (2002)

demonstrate the MTD model using a binarized time series adapted from

MacDonald and Zucchini (1997), which reports the occurrence of at least one epileptic seizure for a patient on each of 204 consecutive days. Berchtold and Raftery (2002) fit several Markov chain and MTD models, using the Bayesian information criterion (BIC) to ultimately select a MTD with eight lags. They report that has the greatest magnitude. Note that the MTD model used in Berchtold and Raftery (2002) allows negative values in , which requires a complex set of constraints for estimation. The seizure time series was revisited and fit using the methods in Sarkar and Dunson (2016), who report a model of maximal order 8, with lag 8 having the highest posterior inclusion probability. In contrast with Berchtold and Raftery (2002), they find lag 1 to be the second most important. They report the posterior mode for the number of important lags to be three.

In light of these two analyses, we fit the MTDg and MMTD to the seizure data with and , each with prior settings identical to those used in the simulation studies. Trace plots (not shown) indicate that the marginal posterior distributions over and each are multimodal, suggesting that more than one combination of lags could model the dynamics with similar accuracy. We note also that the assumption of time-homogeneity is questionable, as no seizures were reported in the last 29 days.

The MTDg model concentrates most posterior weight on lag 8, followed distantly by lags 4 and 9. The transition matrix for lag 8 suggests that the status eight days prior is most often replicated in the present (seizure or no seizure). This transition pattern is repeated for lags 4 and 9, producing a compounding effect for repeated seizures in the model, which effect we should emphasize is additive only. That this model clearly selects lag 8 demonstrates the utility of the SBM prior for the MTDg. The prior simultaneously shrinks parameters toward the identifiable model and maintains a conditional stochastic ordering on the lags while maintaining flexibility to select distant lags when this is supported by the data.

The MMTD(10, 4) model with Dirichlet priors on lag configurations mixes primarily over orders one and two. The standard model (with SDM priors on lag configuration weights) shifts more posterior weight to higher orders, with and edging one another in separate MCMC runs. Without clear selection of order and lags, we discourage over-interpretation of the transition probabilities in . However, it is clear that the estimated probabilities favor persisting in previous states. For example, the posterior means for and are 0.78 and 0.73 respectively (posterior medians are 0.92 and 0.85). That is, no occurrence of seizure in recent days yields a high probability for no seizure on the current day, and repeated occurrence of seizures on multiple past days yields a high probability of seizure on the current day.

Figure 1:

Posterior mean (with 95% credible interval) inclusion index for each lag in the seizure analysis, under the MMTD(10,4) model with the Dirichlet priors (left) and SDM priors (right) on lag configuration weights


We can more comprehensively assess lag importance by computing a lag inclusion index as the sum of all products across and for which lag appears in the lag configuration . We compute this for each lag at each MCMC sample. Inference for is included as lag 0 for reference. Due to the shrinking SBM prior for , a high inclusion index for lag 0 should not be interpreted as a lack of Markovian dependence (unless it is near 1 with high confidence). However, a low inclusion index for lag 0 relative to other lags can indicate strong Markovian dependence. We summarize the inclusion index for the models fit to the seizure data in Figure 1, with bars reporting the posterior mean and whiskers reaching to the ends of 95% posterior credible intervals. Note the large uncertainty for this inclusion index for all lags except lag 8 in the model with SDM priors on . We further note that the posterior inclusion pattern across lags (associated with the SDM priors for lag configuration weights) resembles a plot with similar interpretation in Figure 6 (e) of Sarkar and Dunson (2016). The most notable exception is that lag 1 does not feature prominently in our inferences. All analyses, including our three, agree that lag 8 is the most important in determining the transition probability.

5.2 Pink salmon data

Figure 2: Time-series plot (top) and bivariate lag plots (bottom) for the natural logarithm of pink salmon abundance from 1934 to 1963. In the lag plots, denotes abundance at time and horizontal/vertical lines separate quantile-based bins used to assign into discrete states .

We next investigate a time series of annual pink salmon abundance (escapement) at a creek in Alaska, U.S.A. from 1934 to 1963 (Alaska Fisheries Science Center, 2018). Population dynamics for pink salmon provide a testing opportunity for our model because pink salmon have a strict two-year life cycle (Heard, 1991). Thus, we expect even lags to have the most influence in predicting the current year’s population. A time-series plot of the natural logarithm of abundance is given in Figure 2 together with bivariate lag scatter plots. In this scenario, we might expect non-stationarity with long-term trends. It appears from the time series that the even-year population began to struggle in the late 1940s. Repeated interventions throughout the 1950s culminated in a population transfer in 1964 that bisects the complete time series and restricts us to the first segment (Bradshaw and Heintz, 2003). Nevertheless, the lag scatter plots suggest that we should be able to detect lag dependence structure, even with as few as 30 observations. After discretizing the data into sets of quantile-based bins using all 30 years, we fit the proposed models with the same prior settings used for the simulation studies. Because discretization is based on quantiles, results are invariant to monotonic transformations such as the natural logarithm.

The MTDg(5) model fit to the pink salmon time series clearly identifies lag 2 as the most influential (posterior means for , are 0.17, 0.05, 0.67, 0.01, 0.07, and 0.03, respectively). The estimated also closely agrees with the lag-2 scatter plot in Figure 2. In contrast, the MMTD(5,2) model shifts considerable posterior weight toward order two despite the shrinkage prior on order. Uncertainty, stemming from noisy dynamics and a small sample size, again results in a multimodal posterior, as seen in the density plots for in Figure 3. This uncertainty is also apparent in posterior inferences for the lag inclusion index. Credible intervals on the lag inclusion index are wide enough to warrant their omission from Figure 4. In these plots, we see essential agreement between the two prior settings for , with lag 2 being most prominent. Lags 4 and 5 also appear to contribute in some of the favored configurations.

Figure 3: Marginal posterior density plots for in the pink salmon analysis using a SBM prior on order and SDM priors for lag configuration weights.

It is important to examine the MMTD estimate of to verify that the model is not attempting to fit first-order dynamics with a second-order chain. If this were the case, estimates of transition probabilities in would repeat across the second lag index (in this case most likely representing lag 4 and/or 5). The posterior mean estimate of , shown in Figure 5, appears not to have this problem, as consecutive sub-matrices appear not to repeat. This is consistent under both priors. Hence, lag 2 may not be the only important lag.

Figure 4: Posterior mean inclusion index for each lag in the pink salmon analysis under the MMTD(5,2) model with Dirichlet priors (left) and SDM priors (right) on lag configuration weights.
Figure 5: Posterior mean point estimate of the matricized from the MMTD(5,2) pink salmon analysis with SDM priors on . Rows (along the -axis) represent states to which the transition occurs, and columns (along the -axis) represent the states occupied by the first two selected lags, with the state corresponding to the most recent lag changing index first.

6 Summary

We have explored two extensions of the original mixture transition distribution model for high-order Markov chains. The first is a Bayesian approach to a recent extension capable of identifying one or more important lags. The second captures higher-order interactions and can potentially yield useful inferences for order and lag importance. To accomplish the latter, the mixture of mixture transition distributions uses an over-specified model and sparsity-inducing priors to shrink back to an interpretable and informative structure. This is accomplished in a single model without necessitating iterative selection. Furthermore, our MCMC algorithm allows us to evaluate uncertainty about the model structure and transition probabilities. We demonstrated that this model can outperform some of the standard methods in transition probability estimation, and shown its practical utility in data analysis.

The over-specified MMTD model can offer insights into order and active lags provided the modeler approaches analysis attentively. In cases of large sample size or near-determinism, the true structure will immediately manifest in inferences for the mixture weight parameters. More often, lag importance should be aggregated and extracted in post-processing as we demonstrated in Section 5.1 with the lag inclusion index. If multiple lag patterns are prominent in the mixture model or different order components have non-overlapping lag patterns selected in each, the actual order of the time series may be higher than the highest selected model order. We also recommend checking the mixture transition tensors for redundancy, a sign of lower-order dependence. Absent a clearly identified lag structure through inferences on the mixture weights, we claim that this too can be informative.

Both the MTDg and MMTD models can approximate high-order dynamics by exploiting constructive additivity among lower-order transition probabilities. However, when this does not apply, a full jump to the next order in the MMTD is required. For example, in the salmon data analysis with five lags, the first-order mixture has five components associated with a transition matrix, whereas the second-order mixture has ten components associated with a transition tensor. A parsimonious compromise might rely on some factorization of the second-order tensor, as in Sarkar and Dunson (2016). We do not pursue this here, but rather choose to emphasize the interpretable structure of the proposed MMTD model (2.4) which showcases a model-averaging flavor with added flexibility.

Appendix A Marginal distributions

We report the marginal distributions of observations associated with the Dirichlet and SBM priors for probability vectors given originally in Heiner et al. (2019). These distributions can be useful for computing Bayes factors in addition to facilitating the MCMC algorithms described in Appendices B and C.

Consider a length-

sequence of independent random variables

with common distribution . Given , the probability of the sequence is where the sufficient statistics in count the occurrences of each category.

If follows a Dirichlet distribution with shape parameter vector , then the marginal (prior predictive) distribution of is given by


where denotes the multivariate beta function.

Now suppose follows the SBM distribution with parameters , , , , and . Let

with , and . Then the marginal distribution of has probability mass function