In data analysis, partially observed Markov process (POMP) models (also known as state space models) have been known of as powerful tools in modeling and analyzing time series in many disciplines such as ecology, econometric, engineering and statistics. However, making inferences on POMP models can be problematic because of the presence of incomplete measurements, and possibly weakly identifiable parameters. Typical methods for inference (e.g., maximum likelihood) with strong assumptions of linear Gaussian models have often resulted in undesired outcomes, especially when the assumptions are violated. Simulation-based inferences, also called plug-and-play inferences, are characterized as the dynamic model entering the statistical computations through a numerical solution of the sample paths (Bretó et al., 2009). As a result, there has been a lot of interest in simulation-based inference for POMP models, largely software development for this class of inference. The pomp software package (King et al., 2014) was developed to provide a general representation of a POMP model where several algorithms implemented within pomp are applicable to arbitrary POMP models (Ionides et al., 2006; Toni et al., 2009; Andrieu et al., 2010; Wood, 2010; Ionides et al., 2015; Nguyen and Ionides, 2017)
. However, since pomp is designed years ago for conventional POMP models, applicabilities of these inference methods for very large-scale data set remains an open question. In particular, in the big data regime when the convergence rate is critical, some recently efficient methods have not yet been parts of pomp. Motivated by this need and the popularity of pomp, we develop is2 software package to provide additional efficient algorithms. Based on the recent developments of stochastic optimization literature, we focus more on inference methods using advanced machine learning algorithms, which we consider as a complement of pomp rather than a competing software package.
Specifically, we extend the core functionality of pomp to include particle fixed lag smoothing which is more stable than filtering and reduces computational expense of traditional smoothing. In addition, we provide several efficient inference methods using the full information and full likelihood of the data. The first algorithm developed to carry out such inference is the second order iterated smoothing algorithm of Nguyen and Ionides (2017). Doucet et al. (2013)
showed that sequential Monte Carlo smoothing can give the first derivative as well as Hessian estimation of the log likelihood that have better asymptotic convergence rates than those used for iterated filtering in pomp. One can apply these moments approximations ofDoucet et al. (2013) in a Newton-Raphson type optimization algorithm. We develop a modification of the theory of Doucet et al. (2013)
giving rise to a new algorithm, which empirically shows enhanced performance over other available methods on standard benchmarks. The second implemented algorithm is called momentum iterated filtering where we accumulate a velocity vector in the persistent increase directions of the log likelihood function across iterations. This will help the algorithm achieve results of the same level of accuracy in fewer iterations. The third algorithm developed for simulation-based inference is accelerate iterated filtering. Unlike original iterated filtering, the accelerate method exploits an approximation of the gradient of log likelihood in the accelerated scheme where the convexity and unbiasedness restrictions are relaxed. The theoretical underlying is described in more details in(Nguyen, 2018)
. By avoiding the computational expenses of Hessian, this method is very promising as it is a first order but obtains the quadratic convergence rate of the second order approaches. The fourth algorithm included in the package is average iterated filtering, which has typically been motivated by two time scales optimization process. In this algorithm, the slower scale will be sped up by the faster scale, resulting in a very attractive approach with a simple computation but a fast convergence rate. Lastly, we implement a particle version of iterated filtering algorithm where gradient of the log likelihood as the by-product of iterated filtering is used to improve the proposal distribution in Langevin algorithm in the context of simulation-based algorithm. This algorithm enables a routine full-information plug-and-play Bayesian inference for POMP models, inheriting several advantages such as shortening the burn-in, accelerating the mixing of the Markov chain at the stationary period, and simplifying tuning(Dahlin et al., 2015a).
The key contributions of this paper are three-fold. First, we provide users with ample of new efficient algorithms for simulation-based inferences, demonstrating its ease of use in practical problems. Second, advanced machine learning simulation-based inference algorithms are not only attractive in theory, but we show them also have good numerical performance in practice. In particular, they are applied in both simple toy problems and a challenging scientific problem and they give impressive results. Third, we provide a general framework for interested users not only they can use but also they can develop new algorithms based on smoothing instead of filtering. Consequently, as moments are by-products of iterated smoothing, advanced methods such as proximal optimization simulation-based inference could be implemented in our framework.
The paper is organized as follows. In Section 2, we introduce some backgrounds of simulation-based inferences we investigate. In Section 3, we describe the fixed lag smoothing algorithm in the context of partially observed Markov models, which are later extended to different plug-and-play methodologies. We also give a brief description on advantages and disadvantages of each methodology. Section 4 presents two toy problems, showing substantial gains for our new methods over current alternatives, while Section 5 illustrates the potential applications of some efficient methods in a challenging inference problem of fitting a malaria transmission model to time series data. We conclude in Section 6 with the suggesting of the future works to be extended. More experiments and different functionalities can also be found in the package repository.
2 Background of models and simulation-based inferences
Let be a Markov process with taking values in a measurable space with a finite subset at which is observed, together with some initial time . We write and hereafter for any generic sequence , we shall use to denote . The distribution of is characterized by the initial density and the conditional density of given , written as for where is an unknown parameter in . The process takes values in a measurable space , being only observed through another process . The observations are assumed to be conditionally independent given
. Their probability density is of the form
for . It is assumed that have a joint density on . The data are a sequence of observations , considered as fixed and we write the likelihood function of the data for the POMP model as
A practical limitation for those models is that it is difficult or even impossible to compute the log-likelihood and hence to compute the MLE in closed form. Therefore, MLE process often uses first order stochastic approximation (Kushner and Clark, 2012), which involves a Monte Carlo approximation to a difference equation,
where is an arbitrary initial estimate and is a sequence of step sizes with and . The algorithm converges to a local maximum of under regularity conditions. The term , also called the score function, is shorthand for the -valued vector of partial derivatives,
. However under plug-and-play setting, which does not requires the ability to evaluate transition densities and their derivatives, these approaches are not applicable. Consider a parametric model consisting of a densitywith the log-likelihood of the data given by
. A stochastically perturbed model corresponding to a pair of random variableshaving a joint probability density on can be defined as
Suppose some regularity conditions, (Doucet et al., 2013) showed that
These approximations are useful for latent variable models, where the log likelihood of the model consists of marginalizing over a latent variable, ,
In latent variable models, the expectation in equation (1) can be approximated by Monte Carlo importance sampling, as proposed by Ionides et al. (2011) and Doucet et al. (2013). In Nguyen and Ionides (2017), the POMP model is a specific latent variable model with and . A perturbed POMP model is defined to have a similar construction to a perturbed latent variable model with , and . Ionides et al. (2011) perturbed the parameters by setting to be a random walk starting at , whereas Doucet et al. (2013) took
to be independent additive white noise perturbations of. Taking advantage of the asymptotic developments of Doucet et al. (2013) while maintaining some practical advantages of random walk perturbations for finite computations, Nguyen and Ionides (2017) show that of the extended model can be approximated as follows.
Nguyen and Ionides (2017) also presented an alternative variation on these results which leads to a more stable Monte Carlo estimation.
These theorems are useful for our investigated simulation-based inference approaches because we can approximate the gradient of the likelihood of the extended model to the second order of using either particle filtering or particle smoothing, which fits well with our inference framework in the next section.
3 Methodology for POMP models
We are mostly interested in full-information, either Bayesian or frequentist and plug-and-play methods. Table 1 provides a list of several inference methodologies for POMP models implemented in is2.
|Full information||Second-order iterated smoothing (is2, section 3.2)||Particle iterated filtering (pmif, section 3.6)|
|Momentum iterated filtering (Momentum-Mif, section 3.3)|
|Accelerate iterated filtering (aif, section 3.4),|
|Average iterated filtering (avif, section 3.5)|
3.1 The likelihood function and particle smoothing
The structure of a POMP model indicates its log likelihood can be expressed as the sum of conditional log likelihoods. Typically has no closed form and is generally approximated by Monte Carlo methods. In the primitive bootstrap particle filter, the filtering distribution at each time step is represented by weighted samples , where are random samples of the distribution and the weight is the density at this sample. Initially, each weight is set to and are sampled from the prior distribution . At each time step , we draw a sample from and use observation to compute such that and then we normalize them to sum to 1. Now we sample by resampling them with probabilities proportional . The process goes on and the log likelihood approximation is estimated by . It has been shown that as , the approximation will approach the true log likelihood (e.g. (Del Moral, 2004)) and is typically chosen as a fixed large number.
Primitive bootstrap particle filter is often deteriorated quickly, especially when there are outliers in the observations. This leads to theparticle depletion
problem when the distribution are represented by only a few survival particles. For this reason, particle smoothing is often preferable to particle filtering since the variance of the smoothing distribution is always smaller than that of the filtering distribution. However, a naive particle smoothing algorithm does not get rid of theparticle depletion problem entirely and is known for rather computationally expensive (Kitagawa, 1996; Doucet et al., 2000). Many variations of the particle smoothing algorithm have been proposed to ameliorate this problem. Some approaches may improve numerical performance in appropriate situations (Cappé et al., 2007) but typically lose the plug-and-play property, except for fixed lag smoothing. Since we are interested in plug-and-play property, which assists in investigating of alternative models, we will focus here on fixed lag smoothing. Specifically, rather than targeting the sequence of filtering densities, we target the sequence of distribution . It is often reasonable to assume that for sufficiently large , the future state after the time steps has little information about the current state. This motivates an approximation, in which we only use all observations up to a time for some sufficiently large when estimating the state at time . Mathematically, this is equivalent to approximating marginally by for some lag value , and then approximating with the sequence of densities for and for (see e.g. (Polson et al., 2008). This is done by representing with a set of weighted particles , each of which is a state .
In the fixed lag smoothing, is obtained by tracing back the lineage of the particle . Define , for all , then is .
The algorithm of (Kitagawa, 1996) can be used to approximate for a suitable value . Choosing an optimal might be difficult but in general for any given lag, smoothing is often a better choice compared to filtering. Another advantage of a fixed lag smoothing is that the computational complexity is the same as that of the particle filter. We present the fixed lag smoothing algorithm in Algorithm 1.
3.2 Second-order iterated smoothing
Iterated smoothing, a variant of iterated filtering algorithm, enables inference using simulation-based sequential Monte Carlo smoothing. The basic form of iterated smoothing, proposed by Doucet et al. (2013), was found to have favorable theoretical properties such as being more stable than iterated filtering and producing tighter moment approximations. The innovations of Doucet et al. (2013) are perturbing parameters by white noise and basing on a sequential Monte Carlo solution to the smoothing problem. However, it is an open problem whether the original iterated smoothing approach is competitive in practice. Nguyen and Ionides (2017) modified the theory developed by Doucet et al. (2013) in two key regards.
First, random walk parameter perturbations can be used in place of the white noise perturbations of the original iterated smoothing while preserving much of the theoretical support provided by Doucet et al. (2013). Perturbation in this way has some positive effects:
It inherits second-order convergence properties by using an approximation of the observed information matrix.
It combats computational expense by canceling out some computationally demanding covariance terms.
Plug-and-play property, inherited from the sequential Monte Carlo smoothing, is preserved.
Second, Nguyen and Ionides (2017) modified to construct an algorithm which carries out smoothing, in particular fixed lag smoothing. This Taylor-series approach remains competitive with other state of the art inference approaches both in theory and in practice.
By default, iterated smoothing carries out the fixed lag smoothing with a specified number of lag and an option for choosing either random walk noise or white noise perturbation. A basic version of a second-order iterated smoothing algorithm using random walk noise is available via the method="is2". The second order iterated smoothing algorithm is2 of Nguyen and Ionides (2017) has shown promises in advancing development of simulation-based inferences. In all iterated smoothing methods, by analogy with simulated annealing, it has shown that in the limit as the perturbation intensity approaches zero, the algorithm will reach the maximum likelihood solution. This perturbation process is gradually reduced according to a prescribed cooling schedule in order to allow the likelihood function to converge.
For some time series models, a few unknown initial conditions, called initial value parameters (IVP), are estimated as parameters. As the length of the time series increases, the estimates are not accurate since only early time points have information of the initial state (King et al., 2015). Therefore for these parameters, we only perturb them at time zero. In general, the perturbation distribution is supplied by the user either with each component, , of the parameter vector being perturbed independently or simultaneously. By default, the perturbations on the parameters in Algorithms 5 and 5 of Algorithm 2
is a normal distribution and is perturbed independently. In addition, it is always possible to transform the parameters and choose the perturbation distribution of interest to simplify the algorithm.
This algorithm is especially useful for parameter estimation in state-space models, in which the focus is on continuous space, discrete time and the latent process can be simulated and the conditional density of the observation given the state can be evaluated point-wise up to a multiplicative constant. The parameters are often added to the state space and this modified state-space model has been inferred using sequential Monte Carlo filtering/smoothing. In the literature, there has been a number of previous proposed approaches to estimate the parameter such as (Kitagawa, 1998; Liu and West, 2001; Wan and Van Der Merwe, 2000) but second-order iterated smoothing is distinguished by having an optimal theoretical convergence rate to the maximum likelihood estimate and is fully supported in practice as well.
3.3 Momentum iterated filtering
The underlying theoretical foundation of original iterated filtering (Ionides et al., 2006)
is stochastic gradient descent (SGD). It is generally well known that SGD would easily get stuck in ravines where the curvature surfaces are steeper in some dimensions than the others. It would oscillate across the slopes of the ravine and slowly progress to the local optimum, thus slow down the convergence rate. Some motivations to apply momentum method to iterated filtering rather than using the original iterated filtering consist of:
To accelerate the convergence rate, it accumulates a velocity vector in directions of persistent increase in the log likelihood function across iterations (Sutskever et al., 2013). Specifically, if the dimensions of the momentum term and the gradient are in the same directions, movement in these directions will be accelerated. On the other hand, if the dimensions of the momentum term and the gradient are in the opposite directions, the momentum term will lend a helping hand to dampen the oscillation in these directions. It can be shown that momentum approach requires fewer iterations to get to the identical degree of precision than gradient descent approach (Polyak, 1964).
Instead of re-weighting the update throughout each eigen-direction proportional to the inverse of the associated curvature, momentum method adapts the learning rate by accumulating changes along the updated path. For high dimensional data, it is notoriously admitted that explicit computation of curvature matrix is expensive or even infeasible. Accumulating this information along the way appears to possess a certain computational advantage, especially for sparse data regime.
Another crucial merit of momentum method is the capability to avoid getting trapped in its numerous suboptimal local maxima or saddle points. (Dauphin et al., 2014) postulated that the difficulty arises indeed not from local maxima but from saddle points, i.e. points where one dimension slants up and another inclines down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously difficult for SGD to get away from, as the gradient is near zero in all dimensions. It is vital to examine to what degree, momentum method escapes such saddle points, particularly, in the high-dimensional setting in which the directions of avoiding points may be few.
While the classical convergence theories for momentum method depend on noiseless gradient approximate, it is possible to extend to stochastic setting of iterated filtering. Unfortunately, it has been conjectured that any merit in terms of asymptotic local rate of convergence will be lost (Wiegerinck et al., 1994), and LeCun et al. (2012) confirmed that in extensive experiments. However, fascination in applying momentum methods has received considerable attentions recently after they diminished for a while. In simulation-based inference, the momentum method improves the convergence rate of SGD by adding a short term memory, a fraction of the update vector of the previous one time step to the current update vector. Note that, momentum-based methods will outperform SGD for convex objective functions in the early or transient phase of the optimization process where l/t is the most influential term though both the methods will be equally effective during the ending phase. However, it is an open question whether momentum-based methods yield faster rates in the non-convex setting, specifically when we consider the convergence criterion of second-order stationarity. Motivated by performance of momentum-based methods (e.g. (Ruder, 2016)), we implement momentum iterated filtering algorithm in Algorithm 3. We only focus on providing potential interesting methodologies for simulation-based inference while its theoretical foundations will be published elsewhere.
3.4 Accelerate iterated filtering
Some motivations to accelerate iterated filtering rather than the naive iterated filtering include:
The naive stochastic gradient descent method converges for a general non-convex optimization problem but it does not achieve the optimal rate of convergence, in terms of the functional optimality gap (Ghadimi and Lan, 2016).
The accelerated gradient method in Nesterov (2013) is optimal for solving convex optimization problems, but does not necessarily converge for solving non-convex optimization problems.
Modified accelerated gradient method, which can converge in both convex and non-convex optimization, assumes unbiased estimation of the gradient which is not satisfies for most simulation-based inferences.
Biased estimation of the score function from sequential Monte Carlo filtering/smoothing and perturbation noise may slow down convergence rate of the inference, limiting the range of applicabilities.
Accelerate gradient approaches are typically first-order, meaning that they require only gradient, not Hessian, which is computationally expensive. Thus, it is preferred to second-order approaches for any big data models since it has the same convergence rate.
When targeting item 3, one aims to find a step size scheme which can both control the added noise and doesn’t affect the convergence rate of the accelerated gradient methods. Items 2 and 4 deliberately look for momentum which increases the convergence rate; in this context the momentum is implicit. It has been shown that for some selections of step sizes and moment schemes , and , accelerated gradient method actually converges with an optimal convergence rate in both convex and non-convex case. It is in the same order of magnitude as that for the original gradient descent method (Nesterov, 2005) as long as the perturbation noise is controlled properly. We now add a few remarks about the extension of this approach. This approach is considered quite general since it does not require convex surface of the log-likelihood and the estimation could be biased, especially for the simulation-based estimation. It could be called accelerate inexact gradient approach.
Simulation-based inference of an accelerate iterated filtering is implemented by aif (Algorithm 4). The function requires an exponentially decaying average of the moments to increase the convergence rate. Working on this scheme, there is substantial improvement between the aif approaches of Nguyen and Ionides (2017) and the naive approach of Ionides et al. (2006). Numerical optimization of the accelerate iterated filtering is implemented by aif, which offers a choice of the filtering method Ionides et al. (2011) or smoothing method (Nguyen and Ionides, 2017).
3.5 Average iterated filtering
For class of iterated filtering approaches, under rather general conditions, averaging over the output of each filter will result in optimal convergence rates (Ruppert, 1988; Polyak and Juditsky, 1992; Kushner and Yang, 1993; Kushner, 2010). The parameter estimation corresponding to this averaging is defined as:
Averaging iterated filtering has typically been motivated by two time scales optimization process, in which the slower scale will be sped up by the faster scale. For stochastic approximation, it is advisable to decrease the step size slower to improve stability for finite samples (Spall, 2005). The common practice often chooses the sequence of step size in order of where is closer to . This, however, makes the estimate sequence converge rather slow. The second time scale will step in and improve the overall convergence rate without affecting the stability of the algorithm. The key theoretical insight behind averaging algorithms is that it is a simplest form of two time scale algorithm and it can be implemented on-line without remembering the previous estimation. No additional constraints is imposed on the algorithm while the computation of averaging is simple, making it a very attractive approach.
One interesting feature of averaging is that it can be applied for biased sequences (e.g. sequences generated by the Kiefer-Wolfowitz or SPSA algorithms (Spall, 2005)). In simulation-based inference, by perturbing the sequence and using sequential Monte Carlo filter to estimate, the output of each filter is biased, which is especially suitable in the averaging framework. However, as shown by Dippon and Renz (1997), in general averaging does not yield asymptotically optimal convergence rate. For example, averaging across transients will only make things worse (Spall, 2005). Therefore it is recommended to average only after some warm start:
It is straightforward to apply averaging to any of the available estimators. We only focus on average iterated filtering here as it improves estimation in the naive iterated filtering. The average approximation in Algorithm 5 by default is from all filtering output but it is optional to specified when to start averaging. The average iterated filtering algorithm avif is averaged stochastic gradient descent implementation of sequential Monte Carlo filtering (SMC) (Lindström, 2013). In the same style as iterated filtering, we assume a Gaussian random walk in parameter space. The package also supports different choices of proposal distribution.
3.6 Particle iterated filtering
In Bayesian inference, we are interested in sampling from the parameter posterior distribution,
where denotes the prior distribution of the parameter and denotes the likelihood function.