1 Introduction
Data assimilation is a sequential process, by which the observations are incorporated into a numerical model describing the evolution of this system throughout the whole process. It is applied in many fields, particularly in weather forecasting and hydrology. The quality of the numerical model determines the accuracy of this system, which requires sequential combined state and parameters inferences. An enormous literature has been done on discussing pure state estimation, however, less research is talking about estimating combined state and parameters, particularly in a sequential updating way.
Sequential Monte Carlo method is well studied in the scientific literature and quite prevalent in academic research in the last decades. It allows us to specify complex, nonlinear time series patterns and enables performing realtime Bayesian estimations when it is coupled with Dynamic Generalized Linear Models [58]. However, model’s parameters are unknown in realworld application and it is a limit for standard SMC. Extensions to this algorithm have been done by researchers. Kitagawa [30]
proposed a selforganizing filter and augmenting the state vector with unknown parameters. The state and parameters are estimated simultaneously by either a nonGaussian filter or a particle filter. Liu and West
[32] proposed an improved particle filter to kill degeneracy, which is a normal issue in static parameters estimation. They are using a kernel smoothing approximation, with a correction factor to account for overdispersion. Alternatively, Storvik [52]proposed a new filter algorithm by assuming the posterior depends on a set of sufficient statistics, which can be updated recursively. However, this approach only applies to parameters with conjugate priors
[54]. Particle learning was first introduced in [7]. Unlike Storvik filter, it is using sufficient statistics solely to estimate parameters and promises to reduce particle impoverishment. These particlelike methods are all using more or less sampling and resampling algorithms to update particles recursively.Jonathan proposed in [54]
an SMC algorithm by using ensemble Kalman filter framework for high dimensional space models with observations. Their approach combines information about the parameters from data at different time points in a formal way using Bayesian updating. In
[39], the authors rely on a fixedlag length of data approximation to filtering and sequential parameter learning in a general dynamic statespace model. This approach allows for sequential parameter learning where importance sampling has difficulties and avoids degeneracies in particle filtering. A new adaptive Markov Chain Monte Carlo method yields a quick and flexible way for estimating posterior distribution in parameter estimation [23]. This new Adaptive Proposal method depends on historical data, is introduced to avoid the difficulties of tunning the proposal distribution in MetropolisHastings methods.In this chapter, I’m proposing an adaptive DelayedAcceptance MetropolisHastings algorithm to estimate the posterior distribution for combined state and parameters with two phases. In the learning phase, a selftuning random walk MetropolisHastings sampler is used to learn the parameter mean and covariance structure. In the estimation phase, the parameter mean and covariance structure informs the proposed mechanism and is also used in a delayedacceptance algorithm, which greatly improves sampling efficiency. Information on the resulting state of the system is given by a Gaussian mixture. To keep the algorithm a higher computing efficiency for online estimation, it is suggested to cut off historical data and to use a fixed length of data up to the current state, like a window sliding along time. At the end of this chapter, an application of this algorithm on irregularly sampled GPS time series data is presented.
2 Bayesian Inference on Combined State and Parameters
In a general statespace model of the following form, either the forward map in hidden states or the observation transition matrix is linear or nonlinear. We are considering the model
Observation:  (1)  
Hidden State:  (2) 
where and
are linear processes with Gaussian white noises
and . This model has an initial state and a prior distribution of the parameter is known or can be estimated. Therefore, for a general Bayesian filtering problem with known static parameter , it requires computing the posterior distribution of current state at each time by marginalizing the previous statewhere is the observation information up to time . However, if is unknown, one has to marginalize the posterior distribution for parameter by
(3) 
The approach in equation (3) relies on the two terms : (i) a conditional posterior distribution for the states given parameters and observations; (ii) a marginal posterior distribution for parameter
. Several methods can be used in finding the second term, such as cross validation, Expectation Maximization algorithm, Gibbs sampling, MetropolisHastings algorithm and so on. A Monte Carlo method is popular in research area solving this problem. Monte Carlo method is an algorithm that relies on repeated random sampling to obtain numerical results. To compute an integration of
, one has to sampling as many independent as possible and numerically to find to approximate the target function. In the target function, we draw samples of and use a numerical way to calculate its posterior .Additionally, the marginal posterior distribution for the parameter can be written in two different ways:
(4)  
(5) 
The above formula (4
) is a standard Bayesian inference requiring a prior distribution
. It can be used in offline methods, in which is inferred by iterating over a fixed observation record . In contrast, formula (5) is defined in a recursive way over time depending on the previous posterior at time , which is known as online method. is estimated sequentially as a new observation becomes available.Therefore, the question becomes finding an efficient way to sampling , such as Importance sampling [24] [18], Rejection sampling [8] [34], Gibbs sampling [17], MetropolisHastings method [37] [25] and so on.
2.1 Loglikelihood Function of Parameter Posterior
To sample , firstly we should find its distribution function by starting from the joint covariance matrix of and . With a given , suppose the joint covariance matrix is in the form of
(6) 
where represents the hidden states , represents observed and is the set of all known and unknown parameters. The inverse of the covariance matrix is the procedure matrix. In fact, it is a block matrix in the form of
where is a matrix of forward map hidden states, is a matrix of observation errors up to time . The structure of the matrices, such as bandwidth, sparse density, depending on the structure of the model. Temporally, we are using and to represent the matrices and here. Then we may find the covariance matrix easily by calculating the inverse of the procedure matrix
Because of the covariance , therefore the inverse is
Given the Choleski decomposition , we have
More usefully, by given another Choleski decomposition ,
(7) 
(8) 
From the objective function (4), the posterior distribution of is
Then by taking natural logarithm on the posterior of and using the useful solutions in equations (7) and (8), we will have
(9) 
2.2 The Forecast Distribution
From equation (5), a sequential way for estimating the forecast distribution is needed. Suppose it is
(10) 
Look back to the covariance matrices of observations that we found in the previous section
where the covariance matrix of the joint distribution is
, is a identity matrix. Then, by taking its inverse, we will getwhere is a matrix, is a matrix and is a matrix. Thus by taking its inverse again, we will get
So, from the above covariance matrix, we can find the mean and variance of
are(11)  
(12) 
2.3 The Estimation Distribution
From the joint distribution (6), one can find the best estimation with a given by
Consequently
where
is independent and identically distributed and drawn from a zeromean normal distribution with variance
.For sole , its joint distribution with is
where helps to achieve the last element in the matrix. Thus the filtering distribution of the state is
where, after simplifying, the mean and variance are
(13)  
(14) 
Generally, researchers would like to find the combined estimation for and at time by
Differently, from the target equation (3), the state inference containing
samples is a mixture Gaussian distribution in the following form
(15) 
Suppose is found from equation (13) and (14) for each , then its mean is
(16) 
and the unconditional variance of , by law of total variance, is
(17) 
3 Random Walk MetropolisHastings algorithm
MetropolisHastings algorithm is an important class of Markov Chain Monte Carlo (MCMC) algorithms [50] [56] [20]. This algorithm has been used extensively in physics but was little known to others until Müller [38] and Tierney [56] expounded the value of this algorithm to statisticians. The algorithm is extremely powerful and versatile and has been included in a list of ”The Top 10 Algorithms” with the greatest influence on the development and practice of science and engineering in the 20th century [12] [36].
Given essentially a probability distribution
(the ”target distribution”), MH algorithm provides a way to generate a Markov Chain , who has the target distribution as a stationary distribution, for the uncertain parameters requiring only that this density can be calculated at . Suppose that we can evaluate for any . The transition probabilities should satisfy the detailed balance conditionwhich means that the transition from the current state to the new state has the same probability as that from to . In sampling method, drawing first and then drawing should have the same probability as drawing and then drawing . However, in most situations, the details balance condition is not satisfied. Therefore, we introduce a function satisfying
In this way, a tentative new state is generated from the proposal density and it is then accepted or rejected according to acceptance probability
(18) 
If , then the new state is accepted. Otherwise, the new state is accepted with probability .
Here comes an issues of how to choose . The most widely used subclass of MCMC algorithms is based around the Random Walk Metropolis (RWM). The RWM updating scheme was first applied by Metropolis [37] and proceeds as follows. Given a current value of the dimensional Markov chain , a new value is obtained by proposing a jump from the prespecified Lebesgue density
(19) 
with for all . Here governs the overall size of the proposed jump and plays a crucial role in determining the efficiency of any algorithm. In a random walk, the proposal density function can be chosen for some suitable normal distribution, and hence and cancel in the above equation (18) [45]. Therefore, to decide whether to accept the new state, we compute the quantity
(20) 
If the proposed value is accepted it becomes the next current value ; otherwise the current value is left unchanged [44].
3.1 Selftuning MetropolisHastings Algorithm
In this section, I am proposing a Selftuning MH algorithm with onevariableatatime Random Walk, which can tune step sizes on its own to gain the target acceptance rates, to estimate the structure of parameters in a dimensional space. Supposing all the parameters are independent, the idea of this algorithm is that in each iteration, only one parameter is proposed and the others are kept unchanged. After sampling, take samples out of the total amount of as new sequences. In figure 1, examples of different proposing methods are compared.
To gain the target acceptance rates , the step sizes for each parameter can be tuned automatically. The concept of the algorithm is if the proposal is accepted, then we have more confidence on the direction and step size that were made. In this scenario, the next movement should be further, that means the step size in the next step is bigger than ; otherwise, a conservative proposal is made with a shorter distance, which is .
Supposing and are nonnegative numbers indicating the distances of a forward movement, the new step size from current is
(21) 
where the logarithm guarantees the step size is positive. By taking its expectation
and simplifying to
we can find that
(22) 
Thus, if the proposal is accepted, the step size is tuned to , otherwise .
The complete onevariableatatime MH is illustrated in the following table:
The advantage of the algorithm (1) is that it returns a more accurate estimation for and it is more reliable to learn the structure of parameter space. However, if is in an irregular structure, the algorithm is really timeconsuming and that cause a lower efficiency. To accelerate the computation, we are introducing the Delayed Acceptance MetropolisHastings Algorithm.
3.2 Adaptive Delayed Acceptance MetropolisHastings Algorithm
The DAMH algorithm proposed in [9] is a twostage MetropolisHastings algorithm in which, typically, proposed parameter values are accepted or rejected at the first stage based on a computationally cheap surrogate for the likelihood . In stage one, the quantity is found by a standard MH acceptance formula
where is a cheap estimation for and a simple form is . Once is accepted, the process goes into stage two and the acceptance probability is
(23) 
where the overall acceptance probability ensures that detailed balance is satisfied with respect to ; however if a rejection occurs at stage one then the expensive evaluation of at stage two is unnecessary.
For a symmetric proposal density kernel such as is used in the random walk MH algorithm, the acceptance probability in stage one is simplified to
(24) 
If the true posterior is available then the delayedacceptance MetropolisHastings algorithm is obtained by substituting this for the unbiased stochastic approximation in (23) [47].
To accelerate the MH algorithm, DelayedAcceptance MH requires a cheap approximate estimation in formula (24). Intuitively, the approximation should be efficient with respect to time and accuracy to the true posterior . A sensible option is assuming the parameter distribution at each time is following a normal distribution with mean and covariance . So the posterior density is given by
A lazy is using identity matrix, in which way all the parameters are independent. In terms of , in most of circumstances, 0 is not an idea choice. To find an optimal or suboptimal and , several algorithms have been discussed. In [54], the author is using a secondorder expansion of at the mode and the mean and covariance become and respectively. The drawback of this estimation is a global optimum is not guaranteed. In [35], the author proposed a fast adaptive MCMC sampling algorithm, which is a consist of two phases. In the learning phase, they use hybrid Gibbs sampler to learn the covariance structure of the variance components. In phase two, the covariance structure is used to formulate an effective proposal distribution for a MH algorithm.
Likewise, we are suggesting that use a batch of data with length to learn the parameter space by using selftuning random walk MH algorithm in the learning phase first. This algorithm tunes each parameter at its own optimal step size and explores the surface in different directions. When the process is done, we have a sense of Hypersurface of and its mean and covariance can be estimated. Then we can move to the second phase: DelayedAcceptance MH algorithm. The new is proposed from , which is in the following form
(25) 
where is the Cholesky decomposition, is the tuned step size and is Gaussian white noise. This proposing method reduces the impact of drawing from a correlation space.
3.3 Efficiency of MetropolisHastings Algorithm
In equation (19), the jump size determines the efficiency of RWM algorithm. For a general RWM, it is intuitively clear that we can make the algorithm arbitrarily poor by making either very large or very small [44]. Assuming is extremely large, the proposal , for example, is taken a further distance from current value . Therefore, the algorithm will reject most of its proposed moves and stay where it was for a few iterations. On the other hand, if is extremely small, the algorithm will keep accepting the proposed since is always approximately be 1 because of the continuity of and [42]. Thus, RWM takes a long time to explore the posterior space and converge to its stationary distribution. So, the balance between these two extreme situations must exist. This appropriate step size is optimal, sometimes is suboptimal, the solution to gain a Markov chain. Figure (2) illustrates the performances of RWM with different step size . From these plots we may see that either too large or too small causes high correlation chains, indicating bad samples in sampling algorithm. An appropriate decorrelate samples and returns a stationary chain, which is said to be high efficiency.



Plenty of work has been done to determine the efficiency of MetropolisHastings algorithm in recent years. Gelman, Roberts, and Gilks [16] work with algorithms consisting of a single Metropolis move (not multivariableatatime), and obtain many interesting results for the dimensional spherical multivariate normal problem with symmetric proposal distributions, including that the optimal scale is approximately times the scale of target distribution, which implies optimal acceptance rates of for and for [20]. Roberts and Rosenthal (2001) [42] evaluate scalings that are optimal (in the sense of integrated autocorrelation times) asymptotically in the number of components. They find that an acceptance rate of 0.234 is optimal in many random walk Metropolis situations, but their studies are also restricted to algorithms that consist of only a single step in each iteration, and in any case, they conclude that acceptance rates between 0.15 and 0.5 do not cost much efficiency. Other researchers [41] [3], [4], [46], [43] have been tackled for various shapes of target on choosing the optimal scale of the RWM proposal and led to the similar rule: choose the scale so that the acceptance rate is approximately 0.234. Although nearly all of the theoretical results are based upon limiting arguments in high dimension, the rule of thumb appears to be applicable even in relatively low dimensions [44].
In terms of the step size , it is pointed out that for a stochastic approximation procedure, its step size sequence should satisfy and for some . The former condition somehow ensures that any point of can eventually be reached, while the second condition ensures that the noise is contained and does not prevent convergence [1]. Sherlock, Fearnhead, and Roberts [44] tune various algorithms to attain target acceptance rates, and their Algorithm 2 tunes step sizes of univariate updates to attain the optimal efficiency of Markov chain at the acceptance rates between 0.4 and 0.45. Additionally, Graves in [22] mentioned that it is certain that one could use the actual arctangent relationship to try to choose a good : in the univariate example, if is the desired acceptance rate, then , where
is the posterior standard deviation, will be obtained. In fact, some explorations infer a linear relationship between acceptance rate and step size, which is
, and the slope of the relationship is nearly equal to the constant 1.12 independently. However, in multivariableatatime RWM, one expects that the proper interpretation of is not the posterior standard deviation but the average conditional standard deviation, which is presumably more difficult to estimate from a Metropolis algorithm. In a higher dimensional space, or propose multivariableatatime, suppose is known or could be estimated, then can be proposed from . Thus the optimal step size is required. A concessive way of RWM in high dimension is proposing onevariableatatime and treating them as one dimension space individually. In any case, however, the behavior of RWM on a multivariate normal distribution is governed by its covariance matrix , and it is better than using a fixed distribution[42].To explore the efficiency of a MCMC process, we introduce some notions first. For an arbitrary square integrable function , Gareth, Roberts and Jeffrey [42] define its integrated autocorrelation time by
where is assumed to be distributed according to
. Because central limit theorem, the variance of the estimator
for estimating is approximately . The variance tells us the accuracy of the estimator . The smaller it is, the faster the chain converge. Therefore, they suggest that the efficiency of Markov chains can be found by comparing the reciprocal of their integrated autocorrelation time, which isHowever, the disadvantage of their method is that the measurement of efficiency is highly dependent on the function . Instead, an alternative approach is using Effective Sample Size (ESS) [28] [40]. Given a Markov chain having
iterations, the ESS measures the size of i.i.d.. samples with the same standard error, which is defined in
[21] in the following form ofwhere is the number of samples, is lag of the first , and is the integrated autocorrelation time. Moreover, a wide support among both statisticians [19] and physicists [51] are using the following cost of an independent sample to evaluate the performance of MCMC, that is
Being inspired by their research, we now define the Efficiency in Unit Time (EffUT) and ESS in Unit Time (ESSUT) as follows:
EffUT  (26)  
ESSUT  (27) 
where represents the computation time, which is also known as running time. The computation time is the length of time, in minutes or hours, etc, required to perform a computational process. The best Markov chain with an appropriate step size should not only have a lower correlation, as illustrated in Figure (2), but also have less timeconsuming. The standard efficiency and ESS do not depend on the computation time, but EffUT and ESSUT do. The besttuned step size gains the balance between the size of effective proposed samples and cost of time.
4 Simulation Studies
In this section, we consider the model in regular and irregular spaced time difference separately. For an one dimensional statespace model, we consider the hidden state process is a stationary and ergodic Markov process and transited by . In this paper, we assume that the state of a system has an interpretation as the summary of the past onestep behavior of the system. The states are not observed directly but by another process , which is assumed depending on by the process only and independent with each other. When observed on discrete time , the model is summarized on the directed acyclic following graph
We define . If is a constant, we retrieve a standard AR(1) model process with regular spaced time steps; if is not constant, then the model becomes more complicated with irregular spaced time steps.
4.1 Simulation on Regular Time Series Data
If the time steps are even spaced, the model can be written as a simple linear model in the following
where and are i.i.d.errors occurring in processes and is a static process parameter in forward map. An initial value is known.
To get the joint distribution for and
where , we should start from the procedure matrix , which looks like
and denoted as . Its inverse is the covariance matrix
(28) 
where is a diagonal matrix with elements . The covariance matrices and are easily found.
Parameters Estimation
In formula (4), the parameter posterior is estimated with observation data . By using the algorithm 1, although it may take a longer time, we will achieve a precise estimation. Similarly with section 2.1, from the objective function, the posterior distribution of is
Then by taking natural logarithm on the posterior of and using the useful solutions in equations (7) and (8), we will have
(29) 
In a simple linear case, we are choosing the parameter as the author did in [33] and using dataset, setting initial . Instead of inferring and , we are estimating and in the RWMH to avoid singular proposals. After the process, the parameters can be transformed back to original scale. Therefore, the new parameter .
Buy using algorithm (1) and aiming the optimal acceptance rate at 0.44, after 10 000 iterations we get the acceptance rates for each parameters are and , and the estimations are and respectively. Thus, we have the cheap surrogate . Keep going to the DAMH with another 10 000 iterations, the algorithm returns the best estimation with and . In figure 3, the trace plots illustrates that the Markov Chain of is stably fluctuating around the true .
Recursive Forecast Distribution
Calculating the logposterior of parameters requires finding out the forecast distribution of . A general way is using the joint distribution of and , which is , and following the procedure in section 2.2 to work out the inverse matrix of a multivariate normal distribution. For example, one may find the inverse of the covariance matrix
Therefore, the original form of this covariance is
By denoting and postmultiplying , we will have
(30) 
A recursive way of calculating and is to use the ShermanMorrisonWoodbury formula. In the late 1940s and the 1950s, Sherman and Morrison[48], Woodbury [59], Bartlett [2] and Bodewig [5] discovered the following result. The original ShermanMorrisonWoodbury (for short SMW) formula has been used to consider the inverse of matrices [11]. In this paper, we will consider the more generalized case.
Theorem 1.1 (ShermanMorrisonWoodbury). Let and both be invertible, and
Comments
There are no comments yet.