Bayesian analysis of mixture autoregressive models covering the complete parameter space

06/19/2020 ∙ by Davide Ravagli, et al. ∙ The University of Manchester 0

Mixture autoregressive (MAR) models provide a flexible way to model time series with predictive distributions which depend on the recent history of the process and are able to accommodate asymmetry and multimodality. Bayesian inference for such models offers the additional advantage of incorporating the uncertainty in the estimated models into the predictions. We introduce a new way of sampling from the posterior distribution of the parameters of MAR models which allows for covering the complete parameter space of the models, unlike previous approaches. We also propose a relabelling algorithm to deal a posteriori with label switching. We apply our new method to simulated and real datasets, discuss the accuracy and performance of our new method, as well as its advantages over previous studies. The idea of density forecasting using MCMC output is also introduced.



There are no comments yet.


page 16

page 17

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mixture autoregressive (MAR) models (Wong and Li, 2000) provide a flexible way to model time series with predictive distributions which depend on the recent history of the process. Not only do the predictive distributions change over time, they are also different for different horizons for predictions made at a fixed time point. As a consequence, they inherently accommodate asymmetry, multimodality and heteroskedasticity. For this reason, mixture autoregressive models have been considered a valuable alternative to other models for financial time series, such as the SETAR model (Tong, 1990), the Gaussian transition mixture distribution model (Le et al., 1996), or the widely used class of GARCH models (Nelson, 1991).

Wong and Li (2000) considered estimation of MAR models based on the EM algorithm (Dempster et al., 1977). That method is particularly well suited for mixture-type models and works well. On the other hand, a Bayesian approach can offer the advantage of incorporating the uncertainty in the estimated models into the predictions.

Sampietro (2006) presented the first Bayesian analysis of MAR models. In his work, reversible jump MCMC (Green, 1995) is used to select the autoregressive orders of the components in the mixture, and models with different number of components are compared using methods by Chib (1995) and Chib and Jeliazkov (2001), which exploit the marginal likelihood identity. In addition, he derives analytically posterior distributions for all parameters in the selected model.

The Bayesian updates of the autoregressive parameters are problematic, because the parameters need to be kept in the stationarity region, which is very complex, and so cannot really be updated independently of each other. In the case of autoregressive (AR) models, it is routine to use parametrisation in terms of partial autocorrelations (Jones, 1987), which are subject only to the restriction to be in the interval . Sampietro (2006) adapted this neatly to MAR models by parameterising the autoregressive parameters of each component of the MAR model with the partial autocorrelations of an AR model with those parameters.

A major drawback of Sampietro’s sampling algorithm for the autoregressive parameters, is that it restricts the parameters of each component to be in the stationarity region of an autoregressive model. While this guarantees that the MAR model is stationary, it excludes from consideration considerable part of the stationarity region of the MAR model (Wong and Li, 2000, p. 98; Boshnakov, 2011

). Depending on the mixture probabilities, the excluded part can be substantial. For example, most examples in

Wong and Li (2000, p. 98) cannot be handled by Sampietro’s approach, see also the examples in Section 4.

Hossain (2012) developed a full analysis (model selection and sampling), which reduced the constraints of Sampietro’s analysis. Using Metropolis-Hastings algorithm and a truncated Gaussian proposal distribution for the moves, he directly simulated the autoregressive parameters from their posterior distribution. This method still imposes a constraint on the autoregressive parameters through the choice of boundaries for the truncated Gaussian proposal. While the truncation is used to keep the parameters in the stationarity region, the choice of boundaries is arbitrary and can leave out a substantial part of the stationarity region of the model. In addition, his reversible jump move for the autoregressive order seems conservative, as it uses functions which always prefer jumps towards low autoregressive orders (this will be seen in Section 3.5).

A common problem associated with mixtures is label switching (see for instance Celeux, 2000), which derives from symmetry in the likelihood function. If no prior information is available to distinguish components in the mixture, then the posterior distribution will also be symmetric. It is essential that label switching is detected and handled properly in order to obtain meaningful results. A common way to deal with this, also used by Sampietro (2006) and Hossain (2012), is to impose identifiability constraints. However, it is well known that such constraints may lead to bias and other problems. In the case of MAR models, Hossain (2012) showed that these constraints may affect convergence to the posterior distribution.

We develop a new procedure which resolves the above problems. We propose an alternative Metropolis-Hastings move to sample directly from the posterior distribution of the autoregressive components. Our method covers the complete parameter space. We also propose a way of selecting optimal autoregressive orders using reversible jump MCMC for choosing the autoregressive order of each component in the mixture, which is less conservative than that of Hossain. We propose the use of a relabelling algorithm to deal a posteriori with label switching.

We apply the new method to both simulated and real datasets, and discuss the accuracy and performance of our algorithm, as well as its advantages over previous studies. Finally, we briefly introduce the idea of density forecasting using MCMC output.

The structure of the paper is as follows. In Section 2 we introduce the mixture autoregressive model and the notation we need. In Section 3 we give detailed description of our method for Bayesian analysis of MAR models, including model selection, full description of the sampling algorithm, and the relabelling algorithm to deal with label switching. Section 4 shows results from application of our method to simulated and real dataset. Section 5 introduces the idea of density forecast using MCMC output.

2 The mixture autoregressive model

A process is said to follow a Mixture autoregressive (MAR) process if its distribution function, conditional on past information, can be written as



  • is the sigma field generated by the process up to (and including) . Informally, denotes all the available information at time , the most immediate past.

  • is the total number of autoregressive components.

  • ,

    , are the mixing weights or proportions, specifying a discrete probability distribution. So,


    . We will denote the vector of mixing weights by


  • is the distribution function (CDF) of a standardised distribution with location parameter zero and scale parameter one. The corresponding density function will be denoted by .

  • is the vector of autoregressive parameters for the component, with being the shift. Here, is the autoregressive order of component and we define to be the largest order among the components. A useful convention is to set , for .

  • is the scale parameter for the component. We denote by the vector of scale parameters. Furthermore, we define the precision, , of the component by .

  • If the process starts at , then Equation (1) holds for .

We will refer to the model defined by Equation (1) as model. The following notation will also be needed. Let

The error term associated with the th component at time is defined by


A useful alternative expression for is the following mean corrected form:

Comparing the two representations we get

If , we also have


A nice feature of this model is that the one-step predictive distributions are given directly by the specification of the model with Equation (1). The -steps ahead predictive distributions of at time  can be obtained by simulation (Wong and Li, 2000) or, in the case of Gaussian and -stable components, analytically (Boshnakov, 2009).

We focus here on mixtures of Gaussian components. In this case, using the standard notations and

for the CDF and PDF of the standard Normal distribution, we have

and , for . The model in Equation (1) can hence be written as


or, alternatively, in terms of the conditional pdf


Conditional mean and variance of



The correlation structure of a MAR process with maximum order is similar to that of an process. At lag we have:

2.1 Stability of the MAR model

Stationarity conditions for MAR time series have some similarity to those for autoregressions with some notable differences. Below we give the results we need, see Boshnakov (2011) and the references therein for further details.

A matrix is stable if and only if all of its eigenvalues have moduli smaller than one (equivalently, lie inside the unit circle). Consider the companion matrices

We say that the MAR model is stable if and only if the matrix.

is stable ( is the Kronecker product). If a MAR model is stable, then it can be used as a model for stationary time series. The stability condition is sometimes called stationarity condition.

If , the MAR model reduces to an AR model and the above condition states that the model is stable if and only if is stable, which is equivalent to the same requirement for . For , it is still true that if all matrices , , are stable, then is also stable. However the inverse is no longer true, i.e. may be stable even if one or more of the matrices are not stable.

What the above means is that the parameters of some of the components of a MAR model may not correspond to stationary AR models. It is convenient to refer to such components as “non-stationary”.

Partial autocorrelations are often used as parameters of autoregressive models because they transform the stationarity region of the autoregressive parameters to a hyper-cube with sides . The above discussion shows that the partial autocorrelations corresponding to the components of a MAR model cannot be used as parameters if coverage of the entire stationary region of the MAR model is desired.

3 Bayesian analysis of mixture autoregressive models

3.1 Likelihood function and missing data formulation

Given data , the likelihood function for the MAR model in the case of Gaussian mixture components takes the form in (see Equation (5))

The likelihood function is not very tractable and a standard approach is to recur to a missing data formulation (Dempster et al., 1977).


be a latent allocation random variable, where

is a g-dimensional vector with entry equal to if comes from the component of the mixture, and otherwise. We assume that the

s are discrete random variables, independently drawn from the discrete distribution:


This setup, widely exploited in the literature (see, for instance Dempster et al., 1977; Diebolt and Robert, 1994) allows to rewrite the likelihood function in a much more tractable way as follows:


In practice, the s are not available. We adopt a Bayesian approach to deal with this. We set suitable prior distributions on the latent variables and the parameters of the model and develop a methodology for obtaining posterior distributions of the parameters and dealing with other issues arising in the model building process.

3.2 Priors setup and choice of hyperparameters

The setup of prior distributions is based on Sampietro (2006) and Hossain (2012). In the absence of any relevant prior information it is natural to assume a priori that each data point is equally likely to be generated from any component, i.e.

. This is a discrete uniform distribution, which is a particular case of the multinomial distribution. The conjugate prior of the latter is the Dirichlet distribution. We therefore set the prior for the mixing weigths vector,

, to


The prior distribution on the component means is a normal distribution with common fixed hyperparameters

for the mean and for the precision, i.e.


For the component precisions, , a hierarchical approach is adopted, as suggested in Richardson and Green (1997). Here, for a generic

component the prior is a Gamma distribution with hyperparameters

(fixed) and , which itself follows a gamma distribution with fixed hyperparameters and . We have therefore


The main difference between our approach and that of Sampietro (2006) and Hossain (2012) is in the treatment of the autoregressive parameters.

Sampietro (2006) exploits the one-to-one relationship between partial autocorrelations and autoregressive parameters for autoregressive models descirbed in Jones (1987). Namely, he parameterises each MAR component with partial autocorrelations, draws samples from the posterior distribution of the partial autocorrelations via Gibbs-type moves and converts them to autoregressive parameters using the functional relationship between partial autocorrelations and autoregressive parameters. Of course, the term “partial autocorrellations” doesn’t refer to the actual partial autocorrellations of the MAR process, they are simply transformed parameters. The advantage of this procedure is that the stability region for the partial autocorrelation parameters is just a hyper-cube with marginals in the interval , while for the AR parameters it is a body whose boundary involves non-linear relationships between the parameters.

A drawback of the partial autocorrelations approach in the MAR case is that it covers only a subset of the stability region of the model. Depending on the other parameters, the loss may be substantial.

Hossain (2012) overcomes the above drawbacks by simulating the AR parameters directly. He uses Random Walk Metropolis, while applying a constraint to the proposal distribution (a truncated Normal). The truncation is chosen as a compromise that ensures that most of the stability region is covered, while keeping a reasonable acceptance rate. Although effective with ”well behaved” data, there are scenarios, especially concerning financial examples, in which the loss of information due to a pre-set truncation becomes significant, as will be shown later on. In this paper, we choose Random Walk Metropolis for simulation from the posterior distribution of autoregressive parameters, while exploiting the stability condition to avoid restraining the parameter space a priori.

With the above considerations, for the autoregressive parameters we choose a multivariate uniform distribution with range in the stability region of the model, and independence between parameters is assumed. Hence, for a generic prior distribution is such that:

where denotes the indicator function assuming value if the condition is satisfied and otherwise. We prefer this to a Normal prior as it better allows to explore the parameter space, and detect the presence of multimodality.

Choice of hyperparameters.

Here we discuss the settings for the hyperparameters , , , , and . We have already discussed that the hyperparameters for the Dirichlet prior distribution on the mixing weights (all equal to ). Also, is a hyperparameter but it is a random variable with distribution which will be fully specified once and are.

Following Richardson and Green (1997), let be the length of the interval variation of the dataset. Also fix the two hyperparameters and . The remaining hyperparameters are set as follows:

3.3 Posterior distributions and acceptance probability for RWM

Following Sampietro (2006) and Hossain (2012), posterior distributions for all but the autoregressive parameters are as follows:


where, for ,

All these parameters are updated via a Gibbs-type move. Similarly,

s are simulated from a multinomial distribution with associated posterior probabilities.

To update autoregressive parameters, let , , be the set of current states of the autoregressive parameters, i.e. a set of observations from the posterior distribution of . We can simulate from a proposal distribution, denoted by , with , where

is the identity matrix of size


Here , is a tuning parameter, chosen in such way that the acceptance rate of RWM is optimal () for component . We allow to change between components, but to be constant within the same component. Notice the difference between our proposal and the two-step approach by Sampietro (2006), or the truncated Normal proposal chosen by Hossain (2012). The probability of accepting a move to the proposed is


where , due to the symmetry in the Normal proposal. Therefore, the acceptance probability will only depend on the likelihood ratio of the new set of parameters over the current set of parameters, i.e.



This means that the likelihood ratio for the component is independent of current values of parameters for the remaining components. This enables to calculate likelihood ratios separately for each component.

The procedure described builds a candidate model with updated mixing weights, shift, scale and autoregressive parameters. However, because stability of such model does not only depend on the autoregressive parameters, we must ensure that the stability condition of Section 2.1 is satisfied. If this is not the case, the candidate model and all its parameters are rejected, and the current state of the chain is set to be the same as at the previous iteration.

3.4 Dealing with label switching

Once the samples have been drawn, label switching is dealt with using a -means clustering algorithm proposed by Celeux (2000). It is natural to use the identifiability constraint but it is well known that it is problematic. Examples are given in the discussion to the paper by Richardson and Green (1997). It was shown in fact by Hossain (2012) that applying an identifiability constraint such as may in some cases affect convergence of the chain. With our approach instead, we do not interfere with the chain during the simulation, and hence convergence is not affected.

Our algorithm works by first choosing the first simulated values of the output after convergence. The value shall be chosen small enough for labels switch to not have occurred yet, and large enough to be able to calculate reliable initial values of cluster centres and their respective variances.

Let be a subset of model parameters of size , and the size of the converged sample. For any centre coordinate , we calculate the mean and variance, based on the first simulated values, respectively as:

We set this to be the “true” permutation of the components, i.e. we now have an initial center with variances , . The remaining permutations can be obtained by simply permuting these centres.

From these initial estimates, the iteration () of the procedure consists of two steps:

  • the parameter vector is assigned to the cluster such that the normalised squared distance


    is minimised, where is the centre coordinate and

    its standard deviation, at the latest update


  • Centre coordinates and their variances are respectively updated as follows:




    for .

For the mixture autoregressive case, it is not always clear which subset of the parameters should be used. In fact, group separation might seem clearer in the mixing weights at times, as well as in the scale or shift parameters. Therefore this method requires graphical assistance, i.e. checking the raw output looking for clear group separation. However, it is advisable not to use the autoregressive parameters, especially when the orders are different.

Once the selected subset has been relabelled, labes for the remaining parameters can be switched accordingly.

3.5 Reversible Jump MCMC for choosing autoregressive orders

For this step, we use Reversible Jump MCMC (Green, 1995). At each iteration, one component is randomly chosen from the model. Let be the current autoregressive order of this component, and set to be the largest possible value may assume. For the selected component, we propose to increase or decrease its autoregressive order by with probabilities

where , and such that and . Notice that (or equivalently ) may be any function defined in the interval satisfying such condition. For instance, Hossain (2012) introduced two parametric functions for this step. However, in absence of relevant prior information, we choose in our analysis, while presenting the method in the general case.

Finally, it is necessary to point out that in both scenarios we have a 1-1 mapping between current and proposed model, so that the resulting Jacobian is always equal to .

Given a proposed move, we proceed as follows:

  • If the proposal is to move from to , we simply drop , and calculate the acceptance probability by multiplying the likelihood ratio and the proposal ratio, i.e.


    where is the density of the parameter dropped out of the model, according to its proposal distribution.

    If the candidate model is not stable, then it is automatically rejected, i.e. .

  • If the proposed move is from to , we proceed by simulating the additional parameter from a suitable distribution. In absence of relevant prior information, the choice is to simulate a value from a uniform distribution centred in and with appropriate range, so that values both close and far apart from , both positive and negative, are taken into consideration.

    These considerations lead to draw

    The acceptance probability is in this case


    where is the inverse ofthe density.

    Once again, if the candidate model is not stable, and the current model is retained.

3.6 Choosing the number of components

To select the appropriate number of autoregressive components in the mixture, we apply the methods proposed by Chib (1995) and Chib and Jeliazkov (2001), respectively, for use of output from Gibbs and Metropolis-Hastings sampling. Both make use of the marginal likelihood identity.

From Bayes theorem, we know that


where is the prior distribution on , and is the marginal likelihood function, defined as


with being the parameter vector of the model.

For any values , , number of components and observed data , we can use the marginal likelihood identity to decompose the marginal likelihood into parts that are know or can be estimated


Notice that the only quantity not readily available in the above equation is . However, this can be estimated by running reduced MCMC simulations for fixed (which can be obtained by the RJMCMC method described in Section 5.1), as follows:


Once these quantities are estimated (see 25, 26, 27, 28), plug them in Equation (22), together with the other known quantities, to obtain the marginal likelihood for the model with fixed number of components .

For higher accuracy of results, it is suggested to compare marginal likelihood with different at points of high density in the posterior distribution of . We will use the estimated highest posterior density values.

Estimation of

Suppose we want to estimate , for . We partition the parameter space into two subsets, namely and , where parameters belonging to are fixed (known or already selected high density values).

First, produce a reduced chain of length to obtain , the highest density value for , using the sampling algorithm in Section 4.3, applied to the non-fixed set of parameters only. Define , the set of known (fixed) parameters with the addition of . From a second reduced chain of length , simulate , as well as new observations from the proposal density in Equation 10, centred in .

Now, let and denote acceptance probabilities respectively of the first and second chain. We can finally estimate the value of the posterior density at as


Repeat this procedure for all and multiply the single densities to obtain


Note that there are no requirements on what and should be, granted the first chain is long enough to have reached the stationary distribution.

Estimation of

Run a reduced chain of length . At each iteration, draw observations , , , . Set , the parameter vector of highest posterior density. The posterior density at can be estimated as


Estimation of

Run a reduced chain of length . At each iteration, draw observations , , . Set , the parameter vector of highest posterior density. Posterior density at can be estimated as


Estimation of

Run a reduced chain of length . At each iteration, draw observations . Set , the parameter vector of highest posterior density. Posterior density at can be estimated as


4 Application

For comparative and demonstrative purposes, we show applications of our method using two simulated datasets from (A)

and (B)

respectively with and observations. Process (A) is similar to the one considered by Hossain (2012) and Wong and Li (2000), while (B) was chosen to illustrate in practice how labels switch is dealt with. The issue of labels switch for (B) can be seen in Figure 3, where we show the raw MCMC output with signs of label switch between components 2 and 3 (green and red lines), and the relabelled output after applying the algorithm.

Figure 1: Simulated series from (A) (top) and (B) (bottom).

The algorithm then proceeds as described in Algorithm 1 below:

1:for  do
Algorithm 1
Model (A) Preference Marg. log-lik
Model (B) Preference Marg. log-lik
Table 1: Results from simulation studies. “Preference” is the proportion of times the model was retained against all models with same number of components.

As we can see from Tables, 1, 2 and 3, and Figures 2 and 4, the “true” model is chosen in both cases, as it has the largest marginal log-likelihood. In addition, true values of the parameters are found in high density regions of their respective posterior distributions.

Model A True Value Posterior Mean Standard Error 90% HPDR
0 0.011 0.0268 (-0.032, 0.055)
0 -0.183 3.273 (-5.672, 5.206)
-0.5 -0.449 0.037 (-0.511, -0.389)
1 0.994 0.079 (0.869, 1.136)
1 0.992 0.079 (0.862, 1.119)
2 2.069 0.149 (1.825, 2.311)
0.5 0.571 0.046 (0.494, 0.647)
Table 2: Results of simulation from posterior distribution of the parameters under model (A).
Figure 2: Trace and density plots of selected model from (A). Sample size is 100000, after discarding 50000 observations as burn-in period.
Figure 3: Comparison of raw output (left) and output adjusted for labels switch of mixing weights from (B). We notice the effectiveness of the relabelling algorithm applied to our MCMC.
Model B True Value Posterior Mean Standard Error 90% HPDR
0 0.001 0.018 (-0.009, 0.007)
0 0.005 0.253 (-0.078, 0.091)
0 0.102 2.133 (-3.145, 3.405)
-0.5 -0.483 0.038 (-0.536, -0.427)
0.5 0.498 0.034 (0.450, 0.547)
-0.4 -0.461 0.105 (-0.596, -0.327)
1 0.731 0.264 (0.432, 1.058)
1 1.035 0.246 (0.804, 1.156)
2 2.035 0.439 (1.625, 2.522)
4 4.074 0.341 (3.559, 4.573)
0.5 0.495 0.056 (0.411, 0.568)
0.3 0.293 0.064 (0.207, 0.395)
0.2 0.212 0.041 (0.148, 0.275)
Table 3: Results of simulation from posterior distribution of the parameters under model (B).
Figure 4: Trace and density plots of parameters from (B). Sample size is 100000, after discarding 50000 observations as burn-in period.

To show consistency of the method, the experiment on model (A) was replicated several times. Details on that are available in the Appendix.

4.1 The IBM common stock closing prices

The IBM common stock closing prices (Box and Jenkins, 1976) is a financial time series widely explored several times in the literature (see, for instance Wong and Li, 2000). It contains 369 observations from May to November .

Figure 5: Times series of IBM closing prices (top) and series of the first order differences (bottom)

Following previous studies, we consider the series of first order differences. To allow direct comparison with Wong and Li (2000) and Hossain (2012), we set .

With the procedure outlined in Algorithm 1 our method chooses a to best fit the data, amongst all 2, 3, and 4 component models of maximum order , with a marginal log-likelihood of . We immediately notice that this is different from the selected model in Wong and Li (2000). Such difference may occur as the frequentist approach fails to capture the multimodality in the distribution of certain parameters, which we can clearly see from Figure 6. In fact, by attempting to fit a model by EM-Algorithm from several different starting points, we concluded that this would actually provide a better fit than the chosen by Wong and Li.

Figure 6: Posterior distributions of Autoregressive parameters from selected model , with HPDR highlighted. We can clearly see multimodality occurring for certain parameters. Sample of simulated values post burn-in.

4.2 The Canadian lynx data

Another dataset widely explored in time series literature, and in our interest by Wong and Li (2000), is the annual record of Canadian lynx trapped in the Mackenzie River district in Canada between 1821 and 1934. This dataset, listed by Elton and Nicholson (1942), includes observations.

Following previous studies, we consider the natural logarithm of the data, which presents a typical autoregressive correlation structure with years cycles. We notice the presence of multimodality in the log-data, with two local maxima (see Figure 7). This suggest that the series may be in fact generated by a mixture of two components.

Figure 7: Original time series of Canadian lynx (top left), series of natural logarithms (top right), histogram of log-data (bottom left) and autocorrelation plot of log-data (vottom right). The data presents a typical autoregressive correlation structure, as well as multimodality.

In their analysis, Wong and Li (2000) choose a as best model to fit the data. However, their choice was based on the minimum criterion, which has been acknowledged for not always being reliable for MAR models, particularly with small datasets.

Aiming to have a better insight about the data, we apply our Bayesian method. The selected model is in this case a , preferred over a by the algorithm, and to all and component models with autoregressive order . The marginal log-likelihood of the model is .

Figure 8: Posterior trace plots and density of selected model for the natural logarithm of Canadian lynx data. For all parameters, the credibility region contains the estimated values from Wong and Li (2000). Sample size is , after burn-in iterations.

We generated a sample of size from the posterior distribution of the parameters of the selected model. It is noticed that, for most paramters, the credibility region includes the MLEs obtained by Wong and Li (2000). The only exception stands for the scale parameters, which seem to be slightly larger than such MLEs. However, this may be due to our model containing one fewer AR parameter. On the other hand, these results are in line with the estimates obtained by fitting a using the EM algorithm, since all estimates are well within the corresponding highest posterior density region.

Parameter MLE HD value Standard Error 90% HPDR
0.4957 0.4962 1.6897 (-1.2599, 3.4341)
2.5728 1.6945 1.2663 (-0.0138, 3.8897)
0.9901 1.0779 0.0667 (0.9893 1.1320)
1.5042 1.7205 0.1594 (1.4717, 1.9866)
-0.8984 -0.7966 0.1528 (-1.0578, -0.5604)
0.2313 0.3553 0.1846 (0.2162, 0.6451)
0.4828 0.6010 0.1006 (0.4933, 0.7478)
0.2358 0.3280 0.1247 (0.1536, 0.5555)
Table 4: Summary statistics of sample of size from posterior distributions of the paramters of the selected model for the log-lynx data.

5 Bayesian density forecasts with mixture autoregressive models

Once a sample from the posterior is obtained, it is useful to use these to make predictions on future (or off-set) observations.

Wong and Li (2000) and Boshnakov (2009) respectively introduced a simulation based and an analytical method for for density forecasts assuming a MAR model. The first method relies on Monte Carlo simulations, while the second derives exact h-step ahead predictive distributions of a given observation.

On one hand, we could estimate density forecasts using the highest posterior density values (i.e. the peak of the posterior distribution). However, it is better in this case to exploit the entire simulated sample as follows:

  1. Label each simulation from to , e.g. , .

  2. Calculate density forecast .

  3. Estimate the density forecast

In this way, we obtain a sample from the h-steps ahead density forecast of an observation of interest.

We estimate the 1-step and 2-steps predictive distributions of the IBM data at using the analytical method by Boshnakov (2009), and compare them to the ones obtained by EM algorithm. (see Figure 9). The solid red lines represent the density obtained by Boshnakov (2009) using EM estimates and the exact method. Results of our method are represented by the solid black lines, with the dashed lines as credibility region. The figure also shows how quickly the uncertainty on the predictions grows as we move further in the future, with the 2-step predictive density looking much flatter.

We can see that there are no substantial differences in the shape of these predictive distributions. However, we notice that, particularly for the 2-steps predictor, averaging seems to ”stabilise” the density line.

Figure 9: Density of 1 and 2 steps ahead predictor at

for the IBM data. The solid black line represents our Bayesian method, with the 90% credible interval identified by the dashed lines. The solid red line represents the predicted density using parameter values from EM estimation by Wong and Li.

We notice from the plots that, clearly for the 1-step predictor and slighlty for the 2-step predictor, the density obtained by MCMC attaches higher density the observations of interest and .

6 Conclusion

We presented an innovative fully Bayesian analysis of mixture autoregressive models with Gaussian components, in particular a new method for simulation from the posterior distribution of the autoregressive parameters, which covers the whole stationarity region, compared to previous approaches that constrained it in one way or another. Our approach allowed us to better capture presence of multimodality in the posterior distribution of model parameters. We also introduced a way of dealing with label switching that does not interfere with convergence to the posterior distribution of the model parameters. This consisted in using a relabelling algorithm a posteriori.

Simulations indicate that the method works well. We presented results for two simulated data sets. In both cases the “true” model was selected, and posterior distributions showed high densities regions around the “true” values of the parameters.

The ability of our method to explore the complete stationarity region of the autoregressive parameters allows it to capture better multimodality of distributions. This was illustrated with the IBM and the Canadian Lynx datasets. In the former (Figure 6) we saw how multimodality in the posterior distribution of autoregressive parameters was captured, aspects which were missed in the analyses of Hossain (2012), (see for example Figures 3.10 and 3.11). For this example, it was also noticed that modes of posterior distributions of the autoregressive parameters roughly correspond to point estimates obtained by EM estimation. In the latter (Figure 8), we found the mode of to be quite distant from , with values close to lying in the credibility interval. In this case, the risk with Hossain’s method would be to truncate the Normal proposal at points such that a significant part of the stationarity region of the model is not covered. Sampietro’s method would have failed to detect such a mode, since it is outside the interval .

In conclusion, we may say that our algorithm provides accurate and informative estimation, and therefore may result in more accurate predictions.

Further work could be done to improve the efficiency of our method. Possible improvements to the method include a different algorithm for sampling of autoregressive parameters.

In particular, acceptance rates for the Random Walk Metropolis moves used for sampling the autoregressive parameters can be rather low for mixtures of large number of components or for components with large autoregressive orders, making the algorithm slow at times, with the added risk of it not being able to explore the complete parameter space efficiently. A different procedure, such as the Metropolis Adjusted Langevin Algorithm (MALA), may be considered to improve the efficiency.

Gaussian mixtures are very flexible but alternatives are worth considering. In particular, components with standardised t-distribution could allow modelling heavier tails with small number of components.


We explain here how consistency of the method was assessed, with application to data generated from Model (A) in Section 4.

For this experiment, we simulate different datasets of length from the underlying MAR process in Model (A), and proceeded as follows:

  1. For each dataset, we simulate a sample of size from the posterior distribution of the parameters, after allowing iterations as burn-in period.

  2. For each parameter, we find the overall minimum and maximum over the samples, say and . From here, we identify a grid of equally spaced values in the range , and evaluate the density of such points under each posterior.

  3. Finally, we average for each of the points to obtain a unique average density.

The figure below summarises results of applying this procedure. As we can see, the densities are well in line with the true values of the parameters.

Figure 10: Average densities of the parameters over simulated datasets of length . Each simulation is a sample of size from the posterior distribution of the parameters.


  • G. N. Boshnakov (2009) Analytic expressions for predictive distributions in mixture autoregressive models.. Stat. Probab. Lett. 79 (15), pp. 1704–1709. External Links: Document Cited by: §2, §5, §5.
  • G. N. Boshnakov (2011) On first and second order stationarity of random coefficient models. Linear Algebra Appl. 434 (2), pp. 415–423. External Links: Document Cited by: §1, §2.1.
  • G. E. P. Box and G. M. Jenkins (1976) Time series analysis : forecasting and control / george e.p. box and gwilym m. jenkins. Book, Rev. ed. edition, Holden-Day San Francisco (English). External Links: ISBN 0816211043 Cited by: §4.1.
  • G. Celeux (2000) Bayesian Inference of Mixture: The Label Switching Problem.. Payne R., Green P. (eds) COMPSTAT. Physica, Heidelberg. Cited by: §1, §3.4.
  • S. Chib and I. Jeliazkov (2001) Marginal likelihood from the Metropolis-Hastings output.. J. A. Stat. Ass. 96 (453), pp. 270–281. Cited by: §1, §3.6.
  • S. Chib (1995) Marginal likelihood from the Gibbs output.. J. A. Stat. Ass. 90 (432), pp. 1313–1321. Cited by: §1, §3.6.
  • A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pp. 1–38. Cited by: §1, §3.1, §3.1.
  • J. Diebolt and C. P. Robert (1994) Estimation of finite mixture distributions through bayesian sampling. Journal of the Royal Statistical Society. Series B (Methodological) 56, pp. 363–375. Cited by: §3.1.
  • C. Elton and M. Nicholson (1942) The ten-year cycle in numbers of the lynx in canada. Journal of Animal Ecology 11 (2), pp. 215–244. External Links: ISSN 00218790, 13652656, Link Cited by: §4.2.
  • P. J. Green (1995)

    Reversible jump markov chain monte carlo computation and bayesian model determination

    Biometrika 82 (4), pp. 711–732. Cited by: §1, §3.5.
  • A.B.M. S. Hossain (2012) Complete Bayesian analysis of some mixture time series models. Ph.D. Thesis, Probability and Statistics Group, School of Mathematics, University of Manchester. Cited by: