Markovian inference methods allow for on-line processing of data with time and memory requirements that are constant in the total number of observations received at each time step. Each observation corresponds to a latent variable
, which over time forms a Markov chain, giving rise to the latent state space model. At every time step a new model
is inferred using the Bayes theorem based on the previous state of the modeland current observation :
where the probability distribution (or density)describes a state transition function responsible for the evolution of the latent state . However, the fixed computational cost of such inference depends on assuming conditional independence between current and past states of the latent variable so that:
Contrastingly, in many human learning phenomena the model of the environment seems to be inferred across multiple past states and weighed with respect to their recency (see  for a review). If we want to weigh the effect of each past states separately on the current state then the Markov property (Eq 2) does not hold and we need to find another way of defining the evolution of the latent state. To solve this problem we use the mixture state transition function approach for high-order Markov chains [3, 4]
. This approach represents the conditional probability distributionas a mixture of past states:
where is a mixing coefficient so that
Mixing coefficients make the dependence of the future on the past explicit by quantifying the decay in dependence as the future moves farther from the past . If we use an equal mixing coefficient for all previous states, , then all past states, independent of their lag, contribute equally to the prediction of the current state. Instead, we might want the more recent states represented proportional to their recency. Hence, we can assume that the contribution of past states, reflected by the mixing coefficient , declines over time steps as given by some decay function and the rate of decay parameter . Here we choose a decay function , so that:
where is the rate of decrease () and normalising constant. Substituting into equation (Eq 3) gives:
As a result we have a decaying time window into the past states defined by the rate parameter of the decreasing mixing coefficient . Fig 1 illustrates the relationship between the and parameters: the bigger the the faster the utility of past states decreases and greater the contributions of most recent states to the mixture distribution (Eq 5). As approaches 1, the mixture begins to resemble and approximate a first-order Markov chain:
Conversely, as approaches 0, the mixing coefficient does not decay across time steps and all past states contribute equally to the mixture. Intuitively, could be interpreted as the bias towards more recent states.
However, such an approach poses two fundamental problems. First, to evaluate Eq 5 we still need to explicitly store all past states and the processing time of the algorithm would increase with
. Second, such sequentially estimated mixtures are themselves mixtures of previous mixture distributions, making the estimation analytically intractable. Therefore, such a model would not allow for processing of data with time and memory requirements that are constant at each time step.
Next, we show how a practical solution to both of these problems can be obtained by using sampling methods.
2 State estimation with mixture sampling
In order to limit the computational cost of performing inference at every time step we represent the distribution of latent variable with a fixed number of samples . As a result the proportion of samples assigned to a particular component of the mixture distribution (representing a past state) is determined by the mixing coefficient :
where is the total number of samples, a subset of samples allocated to , a set of samples from that distribution, and is rounded to the nearest integer.
For example, if we use samples to represent a mixture of components with a fixed , then every component would be assigned 20 samples. With a decaying mixing coefficient (Eq 4) the number of samples assigned to any decreases with , but the number of samples assigned to any particular will remain constant (because of the constant rate parameter in Eq 5; also see Fig 1). For example, if , , , and , then will be represented with 50 samples, with 25 samples, and so forth.
The property of constant number of samples for every -th component of the mixture at any time is important since it greatly simplifies the approximation of the mixture distribution (Eq 5). If at every time-step we choose a fixed proportion of samples from the existing mixture distribution and reassign those samples to represent , then after steps we end up with samples allocated across mixture components as given by (Eq 5). To make this explicit, consider the following evolution of mixture (Eq 5) symbolically in a table:
|Mixture representation (Eq 3)||Sample representation|
It follows that at every time step a mixture distribution of past states can be approximated by sampling from just and the previous state of the mixture .
After steps these samples represent a mixture of past states. The number of samples assigned to a particular mixture component decreases with its distance from the present state at the rate of . This algorithm has constant time and memory requirements and uses only two parameters: re-sampling coefficient and the number of total samples .
3 Posterior estimation
So far we have just dealt with high-order chains of latent variables . However, our goal is to incrementally infer the model of the environment every time a new observation
arrives. In other words, we are interested in inferring the posterior probability of the latent variable:
We assume that the evolution of the latent variable is predicted by the mixture state transition function and system noise :
Having previously derived an approximation for sampling from the posterior distribution becomes straightforward. Here we use a sequential Monte Carlo approach called particle filtering [6, 7], where the estimation of the the posterior distribution is based on generating proposals of the latent state with Algorithm 1 and weighing them with the likelihood function . The process can be summarised as follows.
At every time-step a set of particles represents the mixture of past states (Eq 6). In other words, each individual particle is a prediction of the latent state at the current time step. Next we calculate the likelihood of the current observation under each particle. This likelihood serves as the importance weight . We then normalise the importance weights across particles. The set of particles together with the corresponding weights represents the posterior distribution . Next, we create a new predictive distribution for the next state represented by by replacing randomly particles in with samples from the posterior . Finally we add system noise to the particles. This process is described fully in Algorithm 2 below.
This algorithm is based on a widely used method of importance sampling: if we change the re-sampling step so that the predictive prior is re-sampled solely from the posterior distribution by setting , then the algorithm becomes equivalent to the standard importance resampling implementation of a 1st order Markov chain of latent states , where the system transition function is identity and system noise is zero (in other words, the posterior distribution of the current state serves as the predictive prior for the next state , Fig 2, column 1).
In many human incremental learning situations the model of the environment is inferred across multiple past states and weighed with respect to their recency . Here we presented a probabilistic model of incremental learning where the current latent state given some observation depends on recency-weighted past states. Despite the fact that our model approximates high-order dependencies we have described a posterior distribution estimation algorithm which is itself 1st order Markovian and can hence be processed with fixed time and memory costs.
- Bishop  Christopher M. Bishop. 13 Sequential Data. In Pattern Recognition and Machine Learning, pages 605–652. Cambridge University Press, 2006.
- Kiyonaga et al.  Anastasia Kiyonaga, Jason M. Scimeca, Daniel P. Bliss, and David Whitney. Serial Dependence across Perception, Attention, and Memory. Trends in Cognitive Sciences, 21(7):493–497, 2017. ISSN 1879307X. doi: 10.1016/j.tics.2017.04.011. URL http://dx.doi.org/10.1016/j.tics.2017.04.011.
- Berchtold and Raftery  A Berchtold and A Raftery. The Mixture Transition Distribution Model for High-Order Markov Chains and Non-Gaussian Time Series. Statistical Science, 17(3):328–356, 2002. ISSN 0883-4237. doi: 10.1214/ss/1042727943.
Saul and Jordan 
Lawrence K. Saul and Michael I. Jordan.
Mixed memory Markov models: Decomposing complex stochastic processes as mixtures of simpler ones.Machine Learning, 37(1):75–87, 1999. ISSN 08856125. doi: 10.1023/A:1007649326333.
- McDonald et al.  Daniel J. McDonald, Cosma Rohilla Shalizi, and Mark Schervish. Estimating beta-mixing coefficients. In International Conference on Artificial Intelligence and Statistics, pages 516–524, mar 2011. URL http://arxiv.org/abs/1103.0941.
- Cappe et al.  O Cappe, S Godsill, and E Moulines. An Overview of Existing Methods and Recent Advances in Sequential Monte Carlo. Proceedings of the IEEE, 95(5):899–924, may 2007. ISSN 0018-9219. doi: 10.1109/JPROC.2007.893250.
- Doucet et al.  A Doucet, N de Freitas, and N Gordon. An Introduction to Sequential Monte Carlo methods. In A Doucet, N de Freitas, and N Gordon, editors, Sequential Monte Carlo Methods in Practice. Springer, NY, 2001.
Donald B. Rubin.
The Calculation of Posterior Distributions by Data Augmentation: Comment: A Noniterative Sampling/Importance Resampling Alternative to the Data Augmentation Algorithm for Creating a Few Imputations When Fractions of Missing Information Are Modest: The SIR.Journal of the American Statistical Association, 82(398):543, jun 1987. ISSN 01621459. doi: 10.2307/2289460. URL http://www.jstor.org/stable/2289460?origin=crossref.