Recency-weighted Markovian inference

11/08/2017 ∙ by Kristjan Kalm, et al. ∙ University of Cambridge 0

We describe a Markov latent state space (MLSS) model, where the latent state distribution is a decaying mixture over multiple past states. We present a simple sampling algorithm that allows to approximate such high-order MLSS with fixed time and memory costs.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Markovian inference methods allow for on-line processing of data with time and memory requirements that are constant in the total number of observations received at each time step. Each observation corresponds to a latent variable

, which over time forms a Markov chain, giving rise to the latent state space model

[1]. At every time step a new model

is inferred using the Bayes theorem based on the previous state of the model

and current observation :


where the probability distribution (or density)

describes a state transition function responsible for the evolution of the latent state . However, the fixed computational cost of such inference depends on assuming conditional independence between current and past states of the latent variable so that:


Contrastingly, in many human learning phenomena the model of the environment seems to be inferred across multiple past states and weighed with respect to their recency (see [2] for a review). If we want to weigh the effect of each past states separately on the current state then the Markov property (Eq 2) does not hold and we need to find another way of defining the evolution of the latent state. To solve this problem we use the mixture state transition function approach for high-order Markov chains [3, 4]

. This approach represents the conditional probability distribution

as a mixture of past states:


where is a mixing coefficient so that

Mixing coefficients make the dependence of the future on the past explicit by quantifying the decay in dependence as the future moves farther from the past [5]. If we use an equal mixing coefficient for all previous states, , then all past states, independent of their lag, contribute equally to the prediction of the current state. Instead, we might want the more recent states represented proportional to their recency. Hence, we can assume that the contribution of past states, reflected by the mixing coefficient , declines over time steps as given by some decay function and the rate of decay parameter . Here we choose a decay function , so that:


where is the rate of decrease () and normalising constant. Substituting into equation (Eq 3) gives:


As a result we have a decaying time window into the past states defined by the rate parameter of the decreasing mixing coefficient . Fig 1 illustrates the relationship between the and parameters: the bigger the the faster the utility of past states decreases and greater the contributions of most recent states to the mixture distribution (Eq 5). As approaches 1, the mixture begins to resemble and approximate a first-order Markov chain:

Fig 1: Values of the mixing coefficient over past 5 states () based on different values. represents the proportion of a past state in the mixture distribution (Eq 5).

Conversely, as approaches 0, the mixing coefficient does not decay across time steps and all past states contribute equally to the mixture. Intuitively, could be interpreted as the bias towards more recent states.

However, such an approach poses two fundamental problems. First, to evaluate Eq 5 we still need to explicitly store all past states and the processing time of the algorithm would increase with

. Second, such sequentially estimated mixtures are themselves mixtures of previous mixture distributions, making the estimation analytically intractable. Therefore, such a model would not allow for processing of data with time and memory requirements that are constant at each time step.

Next, we show how a practical solution to both of these problems can be obtained by using sampling methods.

2 State estimation with mixture sampling

In order to limit the computational cost of performing inference at every time step we represent the distribution of latent variable with a fixed number of samples . As a result the proportion of samples assigned to a particular component of the mixture distribution (representing a past state) is determined by the mixing coefficient :


where is the total number of samples, a subset of samples allocated to , a set of samples from that distribution, and is rounded to the nearest integer.

For example, if we use samples to represent a mixture of components with a fixed , then every component would be assigned 20 samples. With a decaying mixing coefficient (Eq 4) the number of samples assigned to any decreases with , but the number of samples assigned to any particular will remain constant (because of the constant rate parameter in Eq 5; also see Fig 1). For example, if , , , and , then will be represented with 50 samples, with 25 samples, and so forth.

The property of constant number of samples for every -th component of the mixture at any time is important since it greatly simplifies the approximation of the mixture distribution (Eq 5). If at every time-step we choose a fixed proportion of samples from the existing mixture distribution and reassign those samples to represent , then after steps we end up with samples allocated across mixture components as given by (Eq 5). To make this explicit, consider the following evolution of mixture (Eq 5) symbolically in a table:

Mixture representation (Eq 3) Sample representation
Table 1: The evolution of the latent state via mixture state transition function. The left column shows the evolution of the mixture representation, the right column shows the same as a sample representation. Sample representation: at every time-step we choose a fixed proportion of samples and reassign those samples to represent the most recent mixture component. Here is the initial distribution, { is a set of samples representing the mixture distribution at time , and denotes a set of samples from .

It follows that at every time step a mixture distribution of past states can be approximated by sampling from just and the previous state of the mixture .


Algorithm 1 State estimation with mixture sampling
Initialise by sampling from some distribution .
     t = t+1
      Update by taking samples from
                                                                           and samples from
until t=T

After steps these samples represent a mixture of past states. The number of samples assigned to a particular mixture component decreases with its distance from the present state at the rate of . This algorithm has constant time and memory requirements and uses only two parameters: re-sampling coefficient and the number of total samples .

3 Posterior estimation

So far we have just dealt with high-order chains of latent variables . However, our goal is to incrementally infer the model of the environment every time a new observation

arrives. In other words, we are interested in inferring the posterior probability of the latent variable:

We assume that the evolution of the latent variable is predicted by the mixture state transition function and system noise :


Having previously derived an approximation for sampling from the posterior distribution becomes straightforward. Here we use a sequential Monte Carlo approach called particle filtering [6, 7], where the estimation of the the posterior distribution is based on generating proposals of the latent state with Algorithm 1 and weighing them with the likelihood function . The process can be summarised as follows.

At every time-step a set of particles represents the mixture of past states (Eq 6). In other words, each individual particle is a prediction of the latent state at the current time step. Next we calculate the likelihood of the current observation under each particle. This likelihood serves as the importance weight . We then normalise the importance weights across particles. The set of particles together with the corresponding weights represents the posterior distribution . Next, we create a new predictive distribution for the next state represented by by replacing randomly particles in with samples from the posterior . Finally we add system noise to the particles. This process is described fully in Algorithm 2 below.

This algorithm is based on a widely used method of importance sampling: if we change the re-sampling step so that the predictive prior is re-sampled solely from the posterior distribution by setting , then the algorithm becomes equivalent to the standard importance resampling implementation of a 1st order Markov chain of latent states [8], where the system transition function is identity and system noise is zero (in other words, the posterior distribution of the current state serves as the predictive prior for the next state , Fig 2, column 1).


Algorithm 2 Importance sampling
Initialise particles by sampling from some distribution .
Use as a prediction for
     t = t+1
      Assign weights to samples
      Normalise weights
      Randomly select particle indices
      Assign selected particles to samples from the posterior
      New predictive latent distribution is a mixture of current posterior and previous mixture
      Add system noise to the particles
until t=T
Fig 2: Components of the state mixture distribution for different values of the parameter (horisontal axis). When the predictive prior for the next state consists entirely of samples from the last state . As approaches zero the mixture distribution will contain roughly equal amount of samples from all of the past states.

4 Conclusions

In many human incremental learning situations the model of the environment is inferred across multiple past states and weighed with respect to their recency [2]. Here we presented a probabilistic model of incremental learning where the current latent state given some observation depends on recency-weighted past states. Despite the fact that our model approximates high-order dependencies we have described a posterior distribution estimation algorithm which is itself 1st order Markovian and can hence be processed with fixed time and memory costs.