Adaptive Sequential MCMC for Combined State and Parameter Estimation

by   Zhanglong Cao, et al.

In the case of a linear state space model, we implement an MCMC sampler with two phases. In the learning phase, a self-tuning sampler is used to learn the parameter mean and covariance structure. In the estimation phase, the parameter mean and covariance structure informs the proposed mechanism and is also used in a delayed-acceptance algorithm. Information on the resulting state of the system is given by a Gaussian mixture. In on-line mode, the algorithm is adaptive and uses a sliding window approach to accelerate sampling speed and to maintain appropriate acceptance rates. We apply the algorithm to joined state and parameter estimation in the case of irregularly sampled GPS time series data.



There are no comments yet.


page 9

page 14

page 22

page 24

page 25

page 33

page 34

page 37


Distributed Adaptive LMF Algorithm for Sparse Parameter Estimation in Gaussian Mixture Noise

A distributed adaptive algorithm for estimation of sparse unknown parame...

On-line Bayesian parameter estimation in general non-linear state-space models: A tutorial and new results

On-line estimation plays an important role in process control and monito...

Ensemble sampler for infinite-dimensional inverse problems

We introduce a new Markov chain Monte Carlo (MCMC) sampler for infinite-...

Parameter Estimation of absolute continuous four parameter Geometric Marshall-Olkin bivariate Pareto Distribution

In this paper we formulate a four parameter absolute continuous Geometri...

The Taxicab Sampler: MCMC for Discrete Spaces with Application to Tree Models

Motivated by the problem of exploring discrete but very complex state sp...

Inference of Stochastic Disease Transmission Models Using Particle-MCMC and a Gradient Based Proposal

State-space models have been widely used to model the dynamics of commun...

Joint state-parameter estimation of a nonlinear stochastic energy balance model from sparse noisy data

While nonlinear stochastic partial differential equations arise naturall...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data assimilation is a sequential process, by which the observations are incorporated into a numerical model describing the evolution of this system throughout the whole process. It is applied in many fields, particularly in weather forecasting and hydrology. The quality of the numerical model determines the accuracy of this system, which requires sequential combined state and parameters inferences. An enormous literature has been done on discussing pure state estimation, however, less research is talking about estimating combined state and parameters, particularly in a sequential updating way.

Sequential Monte Carlo method is well studied in the scientific literature and quite prevalent in academic research in the last decades. It allows us to specify complex, non-linear time series patterns and enables performing real-time Bayesian estimations when it is coupled with Dynamic Generalized Linear Models [58]. However, model’s parameters are unknown in real-world application and it is a limit for standard SMC. Extensions to this algorithm have been done by researchers. Kitagawa [30]

proposed a self-organizing filter and augmenting the state vector with unknown parameters. The state and parameters are estimated simultaneously by either a non-Gaussian filter or a particle filter. Liu and West

[32] proposed an improved particle filter to kill degeneracy, which is a normal issue in static parameters estimation. They are using a kernel smoothing approximation, with a correction factor to account for over-dispersion. Alternatively, Storvik [52]

proposed a new filter algorithm by assuming the posterior depends on a set of sufficient statistics, which can be updated recursively. However, this approach only applies to parameters with conjugate priors

[54]. Particle learning was first introduced in [7]. Unlike Storvik filter, it is using sufficient statistics solely to estimate parameters and promises to reduce particle impoverishment. These particle-like methods are all using more or less sampling and resampling algorithms to update particles recursively.

Jonathan proposed in [54]

an SMC algorithm by using ensemble Kalman filter framework for high dimensional space models with observations. Their approach combines information about the parameters from data at different time points in a formal way using Bayesian updating. In

[39], the authors rely on a fixed-lag length of data approximation to filtering and sequential parameter learning in a general dynamic state-space model. This approach allows for sequential parameter learning where importance sampling has difficulties and avoids degeneracies in particle filtering. A new adaptive Markov Chain Monte Carlo method yields a quick and flexible way for estimating posterior distribution in parameter estimation [23]. This new Adaptive Proposal method depends on historical data, is introduced to avoid the difficulties of tunning the proposal distribution in Metropolis-Hastings methods.

In this chapter, I’m proposing an adaptive Delayed-Acceptance Metropolis-Hastings algorithm to estimate the posterior distribution for combined state and parameters with two phases. In the learning phase, a self-tuning random walk Metropolis-Hastings sampler is used to learn the parameter mean and covariance structure. In the estimation phase, the parameter mean and covariance structure informs the proposed mechanism and is also used in a delayed-acceptance algorithm, which greatly improves sampling efficiency. Information on the resulting state of the system is given by a Gaussian mixture. To keep the algorithm a higher computing efficiency for on-line estimation, it is suggested to cut off historical data and to use a fixed length of data up to the current state, like a window sliding along time. At the end of this chapter, an application of this algorithm on irregularly sampled GPS time series data is presented.

2 Bayesian Inference on Combined State and Parameters

In a general state-space model of the following form, either the forward map in hidden states or the observation transition matrix is linear or non-linear. We are considering the model

Observation: (1)
Hidden State: (2)

where and

are linear processes with Gaussian white noises

and . This model has an initial state and a prior distribution of the parameter is known or can be estimated. Therefore, for a general Bayesian filtering problem with known static parameter , it requires computing the posterior distribution of current state at each time by marginalizing the previous state

where is the observation information up to time . However, if is unknown, one has to marginalize the posterior distribution for parameter by


The approach in equation (3) relies on the two terms : (i) a conditional posterior distribution for the states given parameters and observations; (ii) a marginal posterior distribution for parameter

. Several methods can be used in finding the second term, such as cross validation, Expectation Maximization algorithm, Gibbs sampling, Metropolis-Hastings algorithm and so on. A Monte Carlo method is popular in research area solving this problem. Monte Carlo method is an algorithm that relies on repeated random sampling to obtain numerical results. To compute an integration of

, one has to sampling as many independent as possible and numerically to find to approximate the target function. In the target function, we draw samples of and use a numerical way to calculate its posterior .

Additionally, the marginal posterior distribution for the parameter can be written in two different ways:


The above formula (4

) is a standard Bayesian inference requiring a prior distribution

. It can be used in off-line methods, in which is inferred by iterating over a fixed observation record . In contrast, formula (5) is defined in a recursive way over time depending on the previous posterior at time , which is known as on-line method. is estimated sequentially as a new observation becomes available.

Therefore, the question becomes finding an efficient way to sampling , such as Importance sampling [24] [18], Rejection sampling [8] [34], Gibbs sampling [17], Metropolis-Hastings method [37] [25] and so on.

2.1 Log-likelihood Function of Parameter Posterior

To sample , firstly we should find its distribution function by starting from the joint covariance matrix of and . With a given , suppose the joint covariance matrix is in the form of


where represents the hidden states , represents observed and is the set of all known and unknown parameters. The inverse of the covariance matrix is the procedure matrix. In fact, it is a block matrix in the form of

where is a matrix of forward map hidden states, is a matrix of observation errors up to time . The structure of the matrices, such as bandwidth, sparse density, depending on the structure of the model. Temporally, we are using and to represent the matrices and here. Then we may find the covariance matrix easily by calculating the inverse of the procedure matrix

Because of the covariance , therefore the inverse is

Given the Choleski decomposition , we have

More usefully, by given another Choleski decomposition ,


From the objective function (4), the posterior distribution of is

Then by taking natural logarithm on the posterior of and using the useful solutions in equations (7) and (8), we will have


2.2 The Forecast Distribution

From equation (5), a sequential way for estimating the forecast distribution is needed. Suppose it is


Look back to the covariance matrices of observations that we found in the previous section

where the covariance matrix of the joint distribution is

, is a identity matrix. Then, by taking its inverse, we will get

where is a matrix, is a matrix and is a matrix. Thus by taking its inverse again, we will get

So, from the above covariance matrix, we can find the mean and variance of



2.3 The Estimation Distribution

From the joint distribution (6), one can find the best estimation with a given by



is independent and identically distributed and drawn from a zero-mean normal distribution with variance


For sole , its joint distribution with is

where helps to achieve the last element in the matrix. Thus the filtering distribution of the state is

where, after simplifying, the mean and variance are


Generally, researchers would like to find the combined estimation for and at time by

Differently, from the target equation (3), the state inference containing

samples is a mixture Gaussian distribution in the following form


Suppose is found from equation (13) and (14) for each , then its mean is


and the unconditional variance of , by law of total variance, is


3 Random Walk Metropolis-Hastings algorithm

Metropolis-Hastings algorithm is an important class of Markov Chain Monte Carlo (MCMC) algorithms [50] [56] [20]. This algorithm has been used extensively in physics but was little known to others until Müller [38] and Tierney [56] expounded the value of this algorithm to statisticians. The algorithm is extremely powerful and versatile and has been included in a list of ”The Top 10 Algorithms” with the greatest influence on the development and practice of science and engineering in the 20th century [12] [36].

Given essentially a probability distribution

(the ”target distribution”), MH algorithm provides a way to generate a Markov Chain , who has the target distribution as a stationary distribution, for the uncertain parameters requiring only that this density can be calculated at . Suppose that we can evaluate for any . The transition probabilities should satisfy the detailed balance condition

which means that the transition from the current state to the new state has the same probability as that from to . In sampling method, drawing first and then drawing should have the same probability as drawing and then drawing . However, in most situations, the details balance condition is not satisfied. Therefore, we introduce a function satisfying

In this way, a tentative new state is generated from the proposal density and it is then accepted or rejected according to acceptance probability


If , then the new state is accepted. Otherwise, the new state is accepted with probability .

Here comes an issues of how to choose . The most widely used subclass of MCMC algorithms is based around the Random Walk Metropolis (RWM). The RWM updating scheme was first applied by Metropolis [37] and proceeds as follows. Given a current value of the -dimensional Markov chain , a new value is obtained by proposing a jump from the pre-specified Lebesgue density


with for all . Here governs the overall size of the proposed jump and plays a crucial role in determining the efficiency of any algorithm. In a random walk, the proposal density function can be chosen for some suitable normal distribution, and hence and cancel in the above equation (18) [45]. Therefore, to decide whether to accept the new state, we compute the quantity


If the proposed value is accepted it becomes the next current value ; otherwise the current value is left unchanged [44].

3.1 Self-tuning Metropolis-Hastings Algorithm

In this section, I am proposing a Self-tuning MH algorithm with one-variable-at-a-time Random Walk, which can tune step sizes on its own to gain the target acceptance rates, to estimate the structure of parameters in a -dimensional space. Supposing all the parameters are independent, the idea of this algorithm is that in each iteration, only one parameter is proposed and the others are kept unchanged. After sampling, take samples out of the total amount of as new sequences. In figure 1, examples of different proposing methods are compared.

(a) One-variable-at-a-time Random Walk.
(b) Independent Multi-variable-at-a-time Random Walk.
(c) Correlated Multi-variable-at-a-time Random Walk.
Figure 1: Examples of 2-Dimension Random Walk Metropolis-Hastings algorithm. Figure 0(a) is using one-variable-at-a-time proposal Random Walk. At each time, only one variable is changed and the other one stay constant. Figure 0(b) and 0(c) are using multi-variable-at-a-time Random Walk. The difference is in figure 0(b), every forward step are proposed independently, but in 0(c) are proposed according to the covariance matrix.

To gain the target acceptance rates , the step sizes for each parameter can be tuned automatically. The concept of the algorithm is if the proposal is accepted, then we have more confidence on the direction and step size that were made. In this scenario, the next movement should be further, that means the step size in the next step is bigger than ; otherwise, a conservative proposal is made with a shorter distance, which is .

Supposing and are non-negative numbers indicating the distances of a forward movement, the new step size from current is


where the logarithm guarantees the step size is positive. By taking its expectation

and simplifying to

we can find that


Thus, if the proposal is accepted, the step size is tuned to , otherwise .

The complete one-variable-at-a-time MH is illustrated in the following table:

1 Initialization: Given an arbitrary positive step size for each parameter. Set up a value for and find by using formula (22). Set up a target acceptance rate for each parameter, where .
2 Run sampling algorithm: for  from 1 to  do
3       Randomly select a parameter , propose a new one by and leave the rest unchanged.
4       Accept with probability .
5       If it is accepted, tune step size to , otherwise .
6       Set and move to step 1 until .
8 end for
Take samples out from with equal spaced index for each parameter being a new sequence.
Algorithm 1 Self-tuning Random Walk Metropolis-Hastings Algorithm.

The advantage of the algorithm (1) is that it returns a more accurate estimation for and it is more reliable to learn the structure of parameter space. However, if is in an irregular structure, the algorithm is really time-consuming and that cause a lower efficiency. To accelerate the computation, we are introducing the Delayed Acceptance Metropolis-Hastings Algorithm.

3.2 Adaptive Delayed Acceptance Metropolis-Hastings Algorithm

The DA-MH algorithm proposed in [9] is a two-stage Metropolis-Hastings algorithm in which, typically, proposed parameter values are accepted or rejected at the first stage based on a computationally cheap surrogate for the likelihood . In stage one, the quantity is found by a standard MH acceptance formula

where is a cheap estimation for and a simple form is . Once is accepted, the process goes into stage two and the acceptance probability is


where the overall acceptance probability ensures that detailed balance is satisfied with respect to ; however if a rejection occurs at stage one then the expensive evaluation of at stage two is unnecessary.

For a symmetric proposal density kernel such as is used in the random walk MH algorithm, the acceptance probability in stage one is simplified to


If the true posterior is available then the delayed-acceptance Metropolis-Hastings algorithm is obtained by substituting this for the unbiased stochastic approximation in (23) [47].

To accelerate the MH algorithm, Delayed-Acceptance MH requires a cheap approximate estimation in formula (24). Intuitively, the approximation should be efficient with respect to time and accuracy to the true posterior . A sensible option is assuming the parameter distribution at each time is following a normal distribution with mean and covariance . So the posterior density is given by

A lazy is using identity matrix, in which way all the parameters are independent. In terms of , in most of circumstances, 0 is not an idea choice. To find an optimal or suboptimal and , several algorithms have been discussed. In [54], the author is using a second-order expansion of at the mode and the mean and covariance become and respectively. The drawback of this estimation is a global optimum is not guaranteed. In [35], the author proposed a fast adaptive MCMC sampling algorithm, which is a consist of two phases. In the learning phase, they use hybrid Gibbs sampler to learn the covariance structure of the variance components. In phase two, the covariance structure is used to formulate an effective proposal distribution for a MH algorithm.

Likewise, we are suggesting that use a batch of data with length to learn the parameter space by using self-tuning random walk MH algorithm in the learning phase first. This algorithm tunes each parameter at its own optimal step size and explores the surface in different directions. When the process is done, we have a sense of Hyper-surface of and its mean and covariance can be estimated. Then we can move to the second phase: Delayed-Acceptance MH algorithm. The new is proposed from , which is in the following form


where is the Cholesky decomposition, is the tuned step size and is Gaussian white noise. This proposing method reduces the impact of drawing from a correlation space.

3.3 Efficiency of Metropolis-Hastings Algorithm

In equation (19), the jump size determines the efficiency of RWM algorithm. For a general RWM, it is intuitively clear that we can make the algorithm arbitrarily poor by making either very large or very small [44]. Assuming is extremely large, the proposal , for example, is taken a further distance from current value . Therefore, the algorithm will reject most of its proposed moves and stay where it was for a few iterations. On the other hand, if is extremely small, the algorithm will keep accepting the proposed since is always approximately be 1 because of the continuity of and [42]. Thus, RWM takes a long time to explore the posterior space and converge to its stationary distribution. So, the balance between these two extreme situations must exist. This appropriate step size is optimal, sometimes is suboptimal, the solution to gain a Markov chain. Figure (2) illustrates the performances of RWM with different step size . From these plots we may see that either too large or too small causes high correlation chains, indicating bad samples in sampling algorithm. An appropriate decorrelate samples and returns a stationary chain, which is said to be high efficiency.

(a) With a large step size
(b) With a small step size
(c) With a proper step size
Figure 2: Metropolis algorithm sampling for a single parameter with: 1(a) a large step size, 1(b) a small step size, 1(c) an appropriate step size. The upper plots show the sample chain and lower plots indicate the autocorrelation for each case.

Plenty of work has been done to determine the efficiency of Metropolis-Hastings algorithm in recent years. Gelman, Roberts, and Gilks [16] work with algorithms consisting of a single Metropolis move (not multi-variable-at-a-time), and obtain many interesting results for the -dimensional spherical multivariate normal problem with symmetric proposal distributions, including that the optimal scale is approximately times the scale of target distribution, which implies optimal acceptance rates of for and for [20]. Roberts and Rosenthal (2001) [42] evaluate scalings that are optimal (in the sense of integrated autocorrelation times) asymptotically in the number of components. They find that an acceptance rate of 0.234 is optimal in many random walk Metropolis situations, but their studies are also restricted to algorithms that consist of only a single step in each iteration, and in any case, they conclude that acceptance rates between 0.15 and 0.5 do not cost much efficiency. Other researchers [41] [3], [4], [46], [43] have been tackled for various shapes of target on choosing the optimal scale of the RWM proposal and led to the similar rule: choose the scale so that the acceptance rate is approximately 0.234. Although nearly all of the theoretical results are based upon limiting arguments in high dimension, the rule of thumb appears to be applicable even in relatively low dimensions [44].

In terms of the step size , it is pointed out that for a stochastic approximation procedure, its step size sequence should satisfy and for some . The former condition somehow ensures that any point of can eventually be reached, while the second condition ensures that the noise is contained and does not prevent convergence [1]. Sherlock, Fearnhead, and Roberts [44] tune various algorithms to attain target acceptance rates, and their Algorithm 2 tunes step sizes of univariate updates to attain the optimal efficiency of Markov chain at the acceptance rates between 0.4 and 0.45. Additionally, Graves in [22] mentioned that it is certain that one could use the actual arctangent relationship to try to choose a good : in the univariate example, if is the desired acceptance rate, then , where

is the posterior standard deviation, will be obtained. In fact, some explorations infer a linear relationship between acceptance rate and step size, which is

, and the slope of the relationship is nearly equal to the constant -1.12 independently. However, in multi-variable-at-a-time RWM, one expects that the proper interpretation of is not the posterior standard deviation but the average conditional standard deviation, which is presumably more difficult to estimate from a Metropolis algorithm. In a higher -dimensional space, or propose multi-variable-at-a-time, suppose is known or could be estimated, then can be proposed from . Thus the optimal step size is required. A concessive way of RWM in high dimension is proposing one-variable-at-a-time and treating them as one dimension space individually. In any case, however, the behavior of RWM on a multivariate normal distribution is governed by its covariance matrix , and it is better than using a fixed distribution[42].

To explore the efficiency of a MCMC process, we introduce some notions first. For an arbitrary square integrable function , Gareth, Roberts and Jeffrey [42] define its integrated autocorrelation time by

where is assumed to be distributed according to

. Because central limit theorem, the variance of the estimator

for estimating is approximately . The variance tells us the accuracy of the estimator . The smaller it is, the faster the chain converge. Therefore, they suggest that the efficiency of Markov chains can be found by comparing the reciprocal of their integrated autocorrelation time, which is

However, the disadvantage of their method is that the measurement of efficiency is highly dependent on the function . Instead, an alternative approach is using Effective Sample Size (ESS) [28] [40]. Given a Markov chain having

iterations, the ESS measures the size of i.i.d.. samples with the same standard error, which is defined in

[21] in the following form of

where is the number of samples, is lag of the first , and is the integrated autocorrelation time. Moreover, a wide support among both statisticians [19] and physicists [51] are using the following cost of an independent sample to evaluate the performance of MCMC, that is

Being inspired by their research, we now define the Efficiency in Unit Time (EffUT) and ESS in Unit Time (ESSUT) as follows:

EffUT (26)
ESSUT (27)

where represents the computation time, which is also known as running time. The computation time is the length of time, in minutes or hours, etc, required to perform a computational process. The best Markov chain with an appropriate step size should not only have a lower correlation, as illustrated in Figure (2), but also have less time-consuming. The standard efficiency and ESS do not depend on the computation time, but EffUT and ESSUT do. The best-tuned step size gains the balance between the size of effective proposed samples and cost of time.

4 Simulation Studies

In this section, we consider the model in regular and irregular spaced time difference separately. For an one dimensional state-space model, we consider the hidden state process is a stationary and ergodic Markov process and transited by . In this paper, we assume that the state of a system has an interpretation as the summary of the past one-step behavior of the system. The states are not observed directly but by another process , which is assumed depending on by the process only and independent with each other. When observed on discrete time , the model is summarized on the directed acyclic following graph

We define . If is a constant, we retrieve a standard AR(1) model process with regular spaced time steps; if is not constant, then the model becomes more complicated with irregular spaced time steps.

4.1 Simulation on Regular Time Series Data

If the time steps are even spaced, the model can be written as a simple linear model in the following

where and are i.i.d.errors occurring in processes and is a static process parameter in forward map. An initial value is known.

To get the joint distribution for and

where , we should start from the procedure matrix , which looks like

and denoted as . Its inverse is the covariance matrix


where is a diagonal matrix with elements . The covariance matrices and are easily found.

Parameters Estimation

In formula (4), the parameter posterior is estimated with observation data . By using the algorithm 1, although it may take a longer time, we will achieve a precise estimation. Similarly with section 2.1, from the objective function, the posterior distribution of is

Then by taking natural logarithm on the posterior of and using the useful solutions in equations (7) and (8), we will have


In a simple linear case, we are choosing the parameter as the author did in [33] and using dataset, setting initial . Instead of inferring and , we are estimating and in the RW-MH to avoid singular proposals. After the process, the parameters can be transformed back to original scale. Therefore, the new parameter .

Buy using algorithm (1) and aiming the optimal acceptance rate at 0.44, after 10 000 iterations we get the acceptance rates for each parameters are and , and the estimations are and respectively. Thus, we have the cheap surrogate . Keep going to the DA-MH with another 10 000 iterations, the algorithm returns the best estimation with and . In figure 3, the trace plots illustrates that the Markov Chain of is stably fluctuating around the true .

(a) Trace plot of .
(b) Trace plot of .
(c) Trace plot of .
Figure 3: Linear simulation with true parameter . By transforming to original scale, the estimation is .

Recursive Forecast Distribution

Calculating the log-posterior of parameters requires finding out the forecast distribution of . A general way is using the joint distribution of and , which is , and following the procedure in section 2.2 to work out the inverse matrix of a multivariate normal distribution. For example, one may find the inverse of the covariance matrix

Therefore, the original form of this covariance is

By denoting and post-multiplying , we will have


A recursive way of calculating and is to use the Sherman-Morrison-Woodbury formula. In the late 1940s and the 1950s, Sherman and Morrison[48], Woodbury [59], Bartlett [2] and Bodewig [5] discovered the following result. The original Sherman-Morrison-Woodbury (for short SMW) formula has been used to consider the inverse of matrices [11]. In this paper, we will consider the more generalized case.

Theorem 1.1 (Sherman-Morrison-Woodbury). Let and both be invertible, and