Bayesian inference is a popular method for estimating unknown parameters from data, largely due to its ability to quantify uncertainty in the estimation results Gelman et al. (2013)
. In the current work we consider a special class of Bayesian inference problem where data have to be collected in a sequential manner. A typical example of this type of problem is the estimation of parameters, such as the initial states or the equation coefficients, in a dynamical system from observations related to the state vector at discrete times. Such problems arise from many real-world applications, ranging from weather predictionAnnan and Hargreaves (2004) to biochemical networks Golightly and Wilkinson (2011). It should be emphasized that, unlike many data assimilation problems that seek to estimate the time-dependent states in dynamical systems, the parameters that we want to estimate here are assumed do not vary in time. To distinguish the two types of problems, we refer to the former as state estimation problems and the latter as parameter estimation. We should also note that in this work we focus on methods which use samples to represent the posterior distribution, and that approximation based methods, such as the Variational Bayes Beal (2003) and the Expectation Propagation Minka (2001) will not be discussed here. Conventional sampling methods, such as Markov Chain Monte Carlo (MCMC) simulations Gilks et al. (1995), use all the data in a single batch, are unable to take advantage of the sequential structure of the type of problems. On the other hand, sequential methods utilize the sequential structure of the problem and update the posterior whenever a new collection of data become available, which makes them particularly convenient and efficient for sequential inference problems.
A popular sequential method for parameter estimation is the Ensemble Kalman filtering (EnKF) algorithm, which was initially developed to address the dynamical state estimation problems Evensen (2009). The EnKF method was extended to estimate parameters in many practical problems, e.g., Annan and Hargreaves (2004); Annan et al. (2005), and more recently it was generically formulated as a derivative-free optimization based parameter estimation method in Iglesias et al. (2013). The EnKF method for parameter estimation was further developed and analyzed in Arnold et al. (2014); Iglesias (2016); Schillings and Stuart (2017), etc. The basic idea of the EnKF method for parameter estimation is to construct an artificial dynamical system, turning the parameters of interest into the states of the constructed dynamical system, before applying the standard EnKF procedure to estimate the states of the system. A major limitation of the EnKF method is that, just like the original version for dynamical state estimation, it can only compute a Gaussian approximation of the posterior distribution, and the approximation may result in substantial approximation error, unless the actual posterior is highly close to Gaussian. Moreover the approximation error, unlike random sampling error, can not be reduced by increasing the sample size. On the other hand, the Sequential Monte Carlo sampler (SMCS) method Del Moral et al. (2006), does not have such a limitation. The SMCS algorithm is a generalisation of the particle filter Arulampalam et al. (2002); Doucet and Johansen (2009) for dynamic state estimation, generating weighted samples from the posterior distribution. Since the SMCS algorithm was proposed in Del Moral et al. (2006), considerable improvements and extensions of the method have been proposed, such as, Fearnhead et al. (2013); Beskos et al. (2017); Heng et al. (2020); Everitt et al. (2020), and more information on the developments of the SMCS methods can be found in the recent reviews Dai et al. (2020); Chopin and Papaspiliopoulos (2020) On the other hand, we need to note that there are other parameter estimation schemes also based on particle filtering, e.g., Gilks and Berzuini (2001); Chopin (2002), and the differences and connections between SMCS and these schemes are discussed in Del Moral et al. (2006). The SMCS method makes no assumption or approximation of the posterior distribution, and can directly draw (weighted) samples from any posterior. As will be discussed later, a key issue in the implementation of SMCS is the choice of suitable forward and backward kernels, as the performance of SMCS depends critically on the choices of these kernels. As has been shown in Del Moral et al. (2006), the optimal forward and backward kernels exist in principle, but designing effective kernels for specific problems is nevertheless a highly challenging task. In dynamic state estimation problems, often the EnKF approximation is used as the proposal distribution in the particle filtering algorithm Papadakis et al. (2010); Wen et al. (2020), especially for problems in which the posteriors are modestly non-Gaussian. Building upon similar ideas, we propose in this work to construct the kernels in SMCS by using an EnKF framework. Specifically, the forward kernel is obtained directly from an EnKF approximation, and the backward kernel is derived by making a Gaussian approximation of the optimal backward kernel. With several numerical examples we illustrate that the proposed method performs competitively relative to the EnKF approach. The numerical results also demonstrate that the EnKF-SMCS algorithm performs well for highly non-Gaussian posteriors.
The remaining work is organized as follows. In Section 2 we present the generic setup of the sequential inference problems that we consider in this work. In Sections 3 and 4 we respectively review the SMCS and the EnKF methods for solving sequential inference problems. In Section 5 we present the proposed EnkF-SMCS method and in Section 6 we provide several numerical examples to illustrate the performance of the proposed method. Finally Section 7 offers some concluding remarks.
2 Problem setup
We consider a sequential inference problem formulated as follows. Suppose that we want to estimate the parameter from data which become available sequentially in time. In particular the data is related to the parameter of interest via the follow model,
where each is a mapping from to , and the observation noise . It follows that the likelihood function can be written as,
It is important to note here that the restriction that the error model has to be additive Gaussian as is in Eq. (2.1) is due to the use of EnKF. While noting that relaxing such a restriction is possible, we emphasize here that additive Gaussian assumption noise is reasonable for a wide range of practical problems. We can now write the posterior distribution in a sequential form:
where is the prior distribution of , and our goal is to draw samples from for any .
The posterior in Eq. (2.2) is essential in a data tempering formulation, and as is pointed out in Fearnhead et al. (2013); Zhou et al. (2016), such problems pose challenges for usual MCMC methods especially when the amount of data is large, as they cannot conveniently exploit the sequential structure of the problem. In what follows, we first discuss two sequential methods for this type of problems: the EnKF and the SMCS algorithms, and we then propose a scheme to combine these two methods.
3 Sequential Monte Carlo Sampler
We first give a brief introduction to the SMCS method for sampling the posterior distribution , following Del Moral et al. (2006)
. The key idea of SMCS is to construct a joint distribution, the marginal of which is equal to the target distribution . Note here that needs only to be known up to a normalization constant. One then applies the sequential importance sampling algorithm Arulampalam et al. (2002); Doucet and Johansen (2009) to draw weighted samples from , which after being marginalized over , yields samples from .
Next we describe SMCS in a recursive formulation where, given an arbitrary conditional distribution , we can construct a joint distribution of and in the form of,
such that the marginal distribution of over is . Now, given a marginal distribution and a conditional distribution , we can construct an importance sampling (IS) distribution for in the form of
It is important to note here that a key requirement of the IS distribution is that we can directly draw samples from it. We let be an ensemble drawn from , and note that the weighted ensemble follows the distribution , where the weights are computed according to
As can be seen here, once the two conditional distributions and (respectively referred to as the forward and backward kernels in the rest of the paper) are chosen, we can draw samples from Eq. (3.2) and compute the associated weights from Eq. (3.3), obtaining weighted samples from as well as its marginal . The SMCS essentially conducts this procedure in the following sequential manner:
Note here that, a resampling step is often used in SMCS algorithm to alleviate the “sample degeneracy” issue Del Moral et al. (2006).
The resampling techniques are well documented in the PF literature, e.g., Arulampalam et al. (2002); Doucet and Johansen (2009), and so are not discussed here.
As can be seen from the discussion above, to use SMCS one must choose the two kernels. In principle, optimal choices of these kernels are available. For example, it is known that once is provided, one can derive that the optimal choice of is Del Moral et al. (2006):
where the optimality is in the sense of yielding the minimal estimator variance. We also note that use of the optimal L-kernel allows the weights to be written as
Moreover, we can see here that if we can choose such that , then the weight function is always unity, which means that we now sample directly from the target distribution (the ideal case). While obtaining such an ideal is usually not possible in practice, it nevertheless provides useful guideline regarding choice of the forward kernel i.e. it should be chosen such that the resulting is close to . For example, it is proposed in Del Moral et al. (2006) to use the MCMC moves as the forward kernel. A main limitation of the MCMC kernel is that it typically requires a number of MCMC moves to propose a “good” particle, and since each MCMC move involves an evaluation of the underlying mathematical model, , the total computational cost can be high when is computationally intensive.
In this work we consider an alternative to the use of MCMC kernels. Specifically we propose to choose of the form
i.e., a Gaussian distribution with meanand variance , where is a transformation. We shall compute and (or equivalently the forward kernel ) using the EnKF method.
4 Ensemble Kalman Filter
In this section we give a brief overview of the EnKF parameter estimation method proposed in Iglesias et al. (2013), which essentially aims to compute a Gaussian approximation of in each time step . To formulate the problem in an EnKF framework, we first construct an artificial dynamical system denoted by ; at any time , we have the states where , and the dynamical model,
The data is associated to the states through , or equivalently
Now let us see how the EnKF proceeds to compute a Gaussian approximation of the posterior distribution . At time , suppose that the prior can be approximated by a Gaussian distribution with mean and covariance . It follows that the posterior distribution is also Gaussian and its mean and covariance can be obtained analytically:
where is the identity matrix and
is the so-called Kalman gain matrix.
In the EnKF method, one avoids computing the mean and the covariance directly in each step. Instead, both the prior and the posterior distributions are represented with a set of samples. Suppose that at time , we have an ensemble of particles drawn according to the posterior distribution , we can propagate the particles via the dynamical model (4.1):
for , obtaining an assemble following the prior . We can compute a Gaussian approximation, , of , where the mean and the covariance of are estimated from the samples:
Once and are obtained, we then can compute and directly from Eq. (4.2), and by design, the posterior distribution is approximated by . Moreover it can be verified that the samples
with computed by Eq. (4.3), follow the distribution . That is, are the approximate ensemble of , and consequently the associated approximately follows distribution .
Now we shall discuss how to use the EnKF scheme to construct the forward kernel for SMCS. First recall that , and the propagation model , we can derive from Eq. (4.6) that,
Note that the purpose of introducing the small noise term, , in Eq. (5.1a) is to ensure that is strictly positive definite and so is a valid Gaussian conditional distribution. In all the numerical implementations performed in this work, is set to be . According to the discussion in Section 4, we have, if is a good approximation to
That is, Eq. (5.2) provides a good forward Kernel for the SMC sampler. It should be noted here that since is a nonlinear transform, in general, we can not derive the analytical expression for and as a result, we can not use the optimal backward kernel given in Eq. (3.4). Nonetheless, we can use a sub-optimal backward kernel:
where is the Gaussian approximation of , and is an approximation of . Next we need to determine and . Here can be estimated from the ensemble :
Now recall that the issue with the optimal backward kernel is that the transform inside the forward kernel is nonlinear, and as a result can not be computed analytically. Here to obtain in Eq. (5.4) explicitly, we take
and in practice is evaluated from the particles, i.e.,
It follows that the backward kernel , given by Eq. (5.4), is also Gaussian and is given by
It follows that the resulting incremental weight function is
Now using the ingredients presented above, we summarize the EnKF-SMCS scheme in Algorithm 1.
Initialization: draw sample from distribution ;
compute the weights for and renormalize
so that ;
for to do
let for ;
evaluate and with Eq. (4.5), and compute with Eq. (4.3);
draw for with given by Eq. (5.2);
compute from Eq. (5.8);
update the weights:
resample if needed.
It is important to note that a key challenge is yet to be addressed in Algorithm 1. Namely, we can see from Eq. (5.9) that when updating the particle weight, we need to compute , which involves the evaluation of the forward model from to . This operation is required at each time step, and therefore, the total computational cost can be prohibitive if the total number of steps, , is large. We propose a method to tackle the issue, which is based on the following two observations. First, in sequential inference problems, one is only interested in the posterior distribution at the final step where all data are incorporated; second, in many practical problems the posteriors may not vary substantially in several consecutive steps. It therefore may not be necessary to exactly compute the posterior distribution at each time step and, as a result, we only need to sample the posterior distribution in a relatively small number of selected steps. Based on this idea, we propose the following scheme in each time step to reduce the computational cost: we first compute an approximate weight for each particle, and then assess that if some prescribed conditions (based on the approximate weights) are satisfied. If such conditions are satisfied, we evaluate the actual weights of the particles. To implement this scheme, we have to address the following issues:
First we need a method to compute the approximate weight, which should be much easier to compute than the exact weight. Recall that in Eq. (5.9) one has to evaluate which involves computing the forward models from all the way to , and so the computational cost is high. To reduce the computational cost, we propose the following approximate method to evaluate Eq. (5.9). Namely we first write as,
and naturally we can approximate with , yielding,
In principle can be evaluated via Eq. (5.3), however, this is still computionally expensive. Thus we make another approximation, replacing with , where is the Gaussian approximation of given by Eqs. (5.5), and as a result, we obtain
which is used to compute the approximate weights.
Second we need to prescribe the conditions for triggering the computation of the actual weights. Following Green and Maskell (2017), we use the Effective Sample Size (ESS) Doucet and Johansen (2009) (based on the approximate weights) as the main indicator for computing the actual weights. Namely if the ESS calculated with the approximate weights is smaller than a threshold value, the actual weights are computed. Moreover we also have two additional conditions that can also trigger the computation of the actual weights: 1) if the actual weights have not been computed for a given number of steps; 2) if the inference reaches the final step, i.e., .
Finally we shall discuss how to compute the actual weight . It should be noted here that the recursive formulas (3.3) can not be used here since the actual value of is not available. However, letting be the preceding step where the actual weights are computed, and it can be shown that
which is used to calculate the actual weights of the particles.
We refer to this modified scheme as EnKF-SMCS with weight refinement (EnKF-SMCS-WR), the complete procedure of which is described in Algorithm 2. Finally note that in both EnKF-SMCS algorithms, a resampling step is needed.
Initialization: draw sample from distribution ;
compute the weights for and renormalize
so that ;
for to do
let for ;
evaluate and with Eq. (4.5), and compute with Eq. (4.3);
draw for with given by Eq. (5.2);
calculate the approximate weights for :
calculate the ESS of the approximate weights ;
if ESS then
calculate the weights for :
calculate the ESS of the weights ;
if ESS then
6 Numerical examples
We shall provide three examples to demonstrate the performance of the proposed EnKF-SMCS algorithms. The first example is used to illustrate that the proposed method can perform well when the posterior is strongly non-Gaussian and the EnKF becomes highly inaccurate. The second and the third examples show that even for problems where the EnKF performs reasonably well, the EnKF-SMCS can further improve the performance. We emphasize here that in all these examples, we assume that the forward model is computationally intensive and thus the main computational cost arises from the simulation of . As a result, the main computational cost of all methods is measured by the number of forward model evaluations, which in all the methods used in this section is equal to the product of the number of steps and that of the particles.
6.1 The Bernoulli model
Our first example is the Bernoulli equation,
|which has an analytical solution,|
This model is an often-used benchmark problem for data assimilation methods as it exhibits strong non-Gaussian behavior Apte et al. (2007). Here we pose it as a sequential inference problem. Namely, suppose that we can observe the solution of the equation, , at different times for , and we aim to estimate the initial condition
from the sequentially observed data. The observation noise is assumed to follow a zero-mean Gaussian distribution with standard deviation. In this example, we take , and and, moreover, we consider two different noise levels: and . In the numerical experiments, we set the ground truth to be and the data is simulated from the model (6.1) for and , which are shown in Figs. 1. In the experiments the prior distribution for is taken to be uniform: .
We sample the posterior distribution with three methods: the EnKF method in Iglesias et al. (2013), EnKF-SMCS (Algorithm 1), and EnKF-SMCS-WR (Algorithm 2). In each method, we use particles, and the bias error, i.e., the difference between the sample mean, which is a commonly used estimator, and the ground truth is then computed at each time step. The results are shown in Figs. 2 where the left figure show the results for the small noise case () and the right figure shows those for the large noise case (). It is important to note here that, in the EnKF-SMCS-WR method, only the time steps where the actual weights are computed are shown (marked by asterisks). In other words, only the asterisk signs show the correct estimation errors and the line is merely for visual guidance. First, one can see from the figures that all the methods perform in the small noise case, which is sensible as intuitively the inference should be more accurate when the observation noise is small. More importantly, we can also see that in both cases, the EnKF results in significantly higher errors than the two SMCS methods, suggesting that EnKF performs poorly for this highly non-Gaussian example. On the other hand, we observe that the two SMCS algorithms produce largely the same results in both cases (recall that only the asterisks show the results), while EnKF-SMCS-WR only calculates the actual sample weights at 10 time steps in the small noise case and 8 in the large noise case, as is compared to 50 in the EnKF-SMCS, which suggests that the proposed EnKF-SMCS-WR algorithm can significantly reduce the computational cost associated with the weight computation.
6.2 Lorenz 63 model
Our second example is the Lorenz 63 model, a popular example used in several works on parameter estimation, such as Annan and Hargreaves (2004); Mehrkanoon et al. (2012). Specifically the model consists of three variables , and , evolving according to the differential equations
where , and are three constant parameters. In this example we take the true values of the three parameters to be , and , which we assume that we have no knowledge of. Now suppose that observations of are made at a sequence of discrete time points: for and , and we want to estimate the three parameters from these observed data. The measurement noise here is taken to be zero-mean Gaussian with variance , and the priors of the three parameters are also taken to be Gaussian with means , and variances . The data used in our numerical experiments are shown in Fig. 3.
In the numerical experiments, we conduct inference for two different cases: one is that variable is observed and the other is that is observed. In each case we draw samples from the posterior distributions with EnKF, EnKF-SMCS and EnKF-SMCS-WR, where samples are drawn with each method. We plot the estimation bias errors for the case that is observed in Fig. 4 and those for that with being observed in Fig. 5. As before, the time steps where the actual weights are computed are marked by asterisks. One can see that, in both cases, the errors in the EnKF is larger than those in the two SMCS methods, especially for parameter . Once gain, the two SMCS methods yield largely the same bias errors while EnKF-SMCS-WR employs much less computations of the actual weight: 15 time steps in the first case and 14 in the second (indicated by asterisks in Fig. 5). The example shows that even for problems where the posterior distributions are rather close to Gaussian, the SMCS can further improve the estimation accuracy.
6.3 A kinetic model of the ERK pathway
In the last example we consider the parameter estimation problems in kinetic models of biochemical networks. Estimating the kinetic parameters is an essential task in kinetic modeling of the biochemical reaction networks, including genetic regulatory networks and signal transduction pathways Quach et al. (2007). In particular we consider the kinetic model of the Extracellular signal Regulated Kinase (ERK) pathway suppressed by Raf-1 kinase inhibitor protein (RKIP) Kwang-Hyun et al. (2003); Sun et al. (2008). Here we shall omit further details of biological background of the problem and proceed directly to the mathematical formulation of the problem; readers who are interested in more application-related information may consult Kwang-Hyun et al. (2003); Sun et al. (2008).
In this problem the mathematical model that is derived based on enzyme kinetics, and is represented by a dynamical system:
where is the time, is a vector of state variables which are concentrations of metabolites, enzyme and proteins or gene expression levels, is a stoichiometric matrix that describes the biochemical transformation in a biochemical network, and is the vector of reaction rates and is usually the vector of nonlinear function of the state and input variables. Specifically, in this ERK pathway model we have
which forms a system of 11 ordinary differential equations. Moreover the rates of reactionsare Kwang-Hyun et al. (2003); Sun et al. (2008):
In this problem, we can make observations of some of the concentrations at different times, from which we estimate the 11 kinetic parameters