We are interested in parameter inference problems where two types of unknown parameters coexist; one type varies on an indexed time domain (state parameters, or states), and the other type remains fixed (static parameters, or parameters); observations are available on the same indexed time domain as the state. The model is called a state space model, but inference techniques primarily focus on the state, conditional on known fixed parameters.
With fixed parameters, such problems are termed smoothing, filtering, or prediction depending on whether state inference is conducted before, cotemporal to, or after the last observation, respectively (Särkkä, 2013)
. Usually, interest lies in filtering, state inference corresponding to the most recent observation. The Kalman filter(Kalman, 1960)
is a well-known analytic solution for filtering, but there are strict assumptions on uncertainty, which is assumed to be Gaussian in nature. More complex cases typically require Monte Carlo based approach, termed a particle filter (PF), where a vector of proposed states propagates along with the time index(Kitagawa, 1987; Gordon et al., 1993). For recent overviews of the subject see Fearnhead and Künsch (2018) and Naesseth et al. (2019).
Unknown fixed parameters add a surprising degree of complexity to the inference problem. Algorithms for combined parameter and state (CPS) inference have received less attention than filtering, see Kantas et al. (2015); Liu and West (2001) for comprehensive overviews. A straightforward approach introduces an artificial non-degenerate transition density for the fixed parameters and proceeds with the particle filter as though these fixed parameters were states; this approach is problematic since once we add the parameter, the chain is no longer ergodic. If the true parameter value is not within the initial draw of parameter proposals, then the final weighted set of proposals will not contain the true value. This problem is not evident for the state proposals, as they exhibit stochasticity.
More recent techniques embed a PF for state parameters within a larger algorithm for fixed parameters. Examples include: particle Markov chain Monte Carlo (PMCMC)(Drovandi et al., 2016), and sequential Monte Carlo (SMC) (Chopin et al., 2013; Drovandi and McCutchan, 2016). CPS inference techniques have many applications including: ecology (Fasiolo et al., 2016), agent-based models (Lux, 2018), genetic networks (Mariño et al., 2017), and hydrological models (Fenicia et al., 2018). We provide now some theoretical background for state space models and particle filtering.
2.1 State space models and particle filters
State space models are termed hidden Markov models in the seminal paper byBaum and Petrie (1966). It is customary to use ‘state-space’ notation for state parameters where refers to the value of the state parameter at time . We consider state space models where state transitions are Markovian, conditional on unknown fixed parameters :
The state process is latent, but there is an observed process , also conditioned on :
termed the emission distribution. We can see that is conditonally independent, given , of its history . However, since is latent, the filtering distribution
is estimated instead. Two authors that conducted work in this area areMüller (1991) and Fearnhead et al. (1983). More recently, Davey et al. (2015) recreate the path of the flight MH370 from satellite data using a state space model estimated with a particle filter.
Sequential Monte Carlo (SMC) (Del Moral et al., 2006) generalises the PF to standard inference problems with only fixed parameters. parameter proposals with corresponding weights , start from an initial distribution , and transition through a sequence of intermediate distributions , such that the weighted particles ultimately represent a particle approximation to the posterior . Examples of transition distributions include: the sequence of distributions proportional to (Neal, 2001), sequential incorporation of data slices into the posterior (Chopin, 2002), where is the number of slices; and filtering distributions (Doucet and Johansen, 2009), where is the time of the final observation. This last example provides the link between SMC and PFs.
A variety of methods are available to transition particles between distributions; an intuitive approach is called importance sampling (see where the particle approximation to , is combined with a transitional distribution , to target with a proposal distribution proportional to (Gordon et al., 1993). Such a particle filter, based on importance sampling is termed a Bayesian bootstrap particle filter (Algorithm 1). This filter is not necessarily ideal for targeting . Other approaches to particle filtering in literature including: a multi-step MCMC kernel (Drovandi and Pettitt, 2011), and the alive particle filter (Jasra et al., 2013).
2.2 Combined parameter and state inference
We consider now CPS inference, which targets the much more complex distribution
. We are interested in drawing samples from the joint distribution ofand conditional on observations . Initial approaches addressed this problem by appending to the state vector , as though the fixed parameters were dynamic. This approach leads to error, which is highly model specific and poorly understood. This led to the development of approaches which explicitly take the fixed nature of into account. Drovandi et al. (2016) and Drovandi and McCutchan (2016) address this problem by incorporating the alive particle filter into the PMCMC and SMC algorithms respectively. Chopin et al. (2013) incorporates the Bayesian bootstrap filter (Algorithm 1) into a SMC approach, which they term SMC (Algorithm 2) since this particle filter is itself a form of SMC.
Algorithm 2 uses the Bayesian bootstrap filter (Algorithm 1) to estimate within a larger SMC algorithm to return weighted parameter samples . Since weights are updated rather than used within a resampling step at each time step, rejuvenation is necessary. Otherwise will become degenerate. So-called sample impoverishment or particle degeneracy is a problem in sequential particle filters. Djuric et al. (2003) introduce a measure of sample impoverishment called effective sample size (ESS), which we use in this paper. This is equal to
Degeneracy is determined based on a fixed ESS threshold. Once degeneracy is detected, are rejuvenated using a weighted kernel density , where is a Markov kernel with invariant distribution .
The Bayesian bootstrap particle filter (Algorithm 1), which forms part of the SMC algorithm (Algorithm 2), incorporates the likelihood . This distribution, in many cases, cannot be directly evaluated but can be used to draw model realisations. In the next section, we discuss approximate Bayesian computation (ABC), a likelihood-free approach to parameter inference; we also review ABC adaptations of PFs and to CPS inference.
2.3 Approximate Bayesian computation
We introduce ABC initially in terms of standard statistical models without state parameters and where is not time-indexed. ABC describes a range of algorithms designed to approximately sample from its posterior without evaluating . The simplest ABC sampler is the accept-reject sampler (Tavaré et al., 1997), where proposed parameters are sampled from the prior , and accepted if a distance between model realisations and is below some threshold . Other ABC samplers available include SMC-ABC (Chopin, 2002; Del Moral and Murray, 2015), MCMC-ABC (Marjoram et al., 2003), and replenishment SMC-ABC (Drovandi and Pettitt, 2011).
ABC particle filters, based on the Bayesian bootstrap (Algorithm 1), (Jasra et al., 2012; Calvet and Czellar, 2014; Jasra et al., 2013) substitute a likelihood approximator for within Algorithm 1. We give now an example of . Let us assume that we cannot evaluate , but that, knowing and , we can simulate a set , , from this distribution. Then, we have to replace the evaluation of this density with the proportion of that fall near the observation with respect to , i.e,
which is an unbiased estimate of
where is the indicator function. This could be replaced by any positive-valued decreasing function on , but we use for simplicity. Note that , as a function of is a proper density function, up to a normalising factor that depends only on . Finally, since we have values of at hand in the PF algorithm at a given time , we have to perform this ABC algorithm times. We discuss now, how this ABC particle filter, based on the Bayesian bootstrap, is incorporated into a likelihood-free CPS algorithm.
Likelihood-free CPS inference algorithms which have been developed (Drovandi and McCutchan, 2016; Drovandi et al., 2016), require that a sequence of be chosen before initialisation of the CPS algorithm. However, we develop a technique which bypasses this requirement to allow for automatic calibration.
The goal of the algorithm we develop (Algorithm 3) is to provide a sample from the joint posteriors . But, since some distributions are intractable, we perform a likelihood-free approximation to sample from the approximate joint posteriors
where and are marginal and conditional distributions for and , respectively, from , defined in the usual way. The form of this integral suggests to us a strategy for sampling from . Let us assume that, at time , we have a weighted sample that comprises
a weighted sample of size , namely , with weights , …, , whose weighted empirical distribution is a Monte Carlo approximation of ; and
for each value in that sample, a sample , with weights , whose weighted empirical distribution is a Monte Carlo approximation of .
This multinomial resampling is done independently for each current value of the parameter. Hence, for each , we still have an approximation of . For each value , we resample the states using their weights with the multinomial distribution . There are more effective choices than , such as systematic sampling (Li et al., 2015), but this is outside the scope of the paper. We move the resampled states according to to get the proposals. And we draw simulations , , for each , according to the emission distribution . Finally, the unnormalised weight of is
where denotes the index from the multinomial resampling. This allows us to estimate by taking an average from the particle approximation, where .
We discuss now the calibration of . In ABC algorithms, it is usual to calibrate
as a quantileof the distances of the simulated values . To that aim, you can order the particles according to their distances , , and then find the smallest value of such that
is greater than . Note that is unnecessary, but was found to be useful addition through practical experience. Particle degeneracy in weights can mean that only a small proportion of the distances correspond to ’s with non-negligible weights. So it is important to weight the distances, so that the computed is based on distances arising from parameters with high weight.
If of the state space model is intractable, the transition density can be set to so that they cancel one another. This is the approach we take, in order to demonstrate the algorithm for fully intractable problems. In that case, all are equal to , and the ratio (Equation (6)) is just the proportion of accepted simulations, weighted by , marginalised over parameters and states. The resulting algorithm is given in Algorithm 3.
With each ABCSMC update, more data is incorporated into the fixed parameter posterior ; this leads to degeneracy of . To check degeneracy, we use the ESS criteria in Equation 2, with threshold chosen prior to analysis. In order to propose new parameters and further improve the particle approximation to , we need an estimate of the marginal likelihood-function . Chopin et al. (2013) showed that can be estimated, up to a normalising constant, in the following manner:
Once we have for each we can rejuvenate the particle system. The steps involved in parameter rejuvenation are the following:
resample , to make non-degenerate,
generate new parameter proposals for each , from a kernel around ,
run independent ABC particle filters on , with the previously computed thresholds to compute the particle system at time , and
accept each , with probability
It is important to recycle the previously computed thresholds so that equivalent distributions, and , are used within the acceptance probability. The rejuvenation step means that the entire state space model, up to time , is reevaluated, for proposed parameters , within the ABC particle filter whenever become degenerate. This means that the computational complexity of our algorithm is not linear with time. This is also true for the algorithm of Chopin et al. (2013). We demonstrate the use of our self calibrated ABCSMC update (Algorithm 3) with some examples.
4 Example: Skewed normal distribution
As a demonstration of our approach, we consider a skewed version of the normal distribution as defined by Azzalini (1985). The skewed-normal distribution can be parameterised in a similar way to the normal distribution, with an additional parameter for skewness (see Azzalini (2005) for further details); we represent this distribution as SN(). The statistical model is as follows:
with , and where each observation comprises 10 independent and identically distributed observations of the skewed-normal distribution at each time step . The state parameter is the mean parameter of the SN distribution, and the parameters to be inferred are and . The priors are and . The true parameters are and .
The summary statistics for this example are the sample mean, sample standard deviation and skewness. Skewness is calculated using a method discussed inJoanes and Gill (1998):
as implemented by the R package e1071 (Meyer et al., 2019), where
is the sample third central moment andis the sample standard deviation. The distance is the unweighted Euclidean distance between observed and simulated summary statistics. We implemented this model with , , and for time . The ESS threshold we use is . The kernel
for the rejuvenation algorithm is Gaussian with mean and variance parameters estimated from the weighted set of.
Inference results for the marginalised filtering distribution for , as shown in Figure 1. We can see that the algorithm performs well in matching the true state. The posterior for the fixed parameters is shown in Figure 2; the results here are more mixed. The parameter is estimated well, with posterior mass concentrated near the true parameter, whereas the parameter displays more posterior variance.
5 Example: Hawkes process
Hawkes processes, as introduced by Hawkes (1971a, b) are a type of self-exciting point process. The term, self-exciting, means that events are not independent. Previous events raise the probability a new event occurring. The rate for a Hawkes process is . We use rather than to emphasis that this time variable is continuous. Where is a baseline rate function which is conditionally independent of ; is a kernel function evaluated over and time . We describe now the specifics of the Hawkes process we intend to study.
We set to be a step function, defined as , where
is the inverse logit function and,is the indicator function for set . The coefficients are latent variables and form the state space for our example. Transitions between states are normally distributed, . The kernel function, , is the defined in the following manner:
Once we combine the and the kernel function we have the overall Hawkes process rate:
The state controls the independent component of the point process, and the parameters and control the self-exciting component of the point process; controls the strength of the self-exciting process and controls the shape of the self-exciting process.
The ordered (ascending) set of all events which fall within the continuous time epoch is denoted as . The state space model is as follows:
In this example, the distribution of is not conditionally independent of , given . Since is observed directly, it is used to simulate , rather than the previously simulated realisation . Note that, in either case, it is possible that is the null set. This complicates definitions of summary statistics, so we append some imaginary events and . The observations (and model realisations) are now , where and . The following summary statistics are now defined on the new : the number of events in the interval, the sum of squares of inter-event times,
the sum of cubes of inter-event times,
and the minimum of differences,
These summary statistics were used to construct estimators for parameters and
, based on the summary statistics. To construct the estimators, we performed linear regression adjustment on these summary statistics(Beaumont et al., 2002). The distance is the Euclidean distance on these empirical estimators. Tuning parameters used in the SMC procedure were the following . The transition kernel (Algorithm 3) is defined in the following manner:
where is the maximum likelihood estimate of the covariance matrix of and is a fixed tuning parameter set to 0.1 to ensure high acceptance. Based on our definition of , the ratio of transition kernels is derived
The synthetic dataset we consider as observed data is simulated with known parameter values . The observations are simulated up to . The true state is shown, in red, in Figure 3; along with the Bayesian prediction intervals for the marginalised filtering distributions. The results show very good correspondence with the true state, considering the complexity and dependencies within the Hawkes simulation (Equation 11).
The priors were for both parameters. The posteriors are shown in Figure 4. The precision of the state inference is not reflected in the parameter inference, parameters. The posterior variance is slightly higher for , which may be related to how the distance was defined, with respect to estimators on and .
6 Example: Econometric model
Stochastic volatility models are the most used state space models in the econometric literature. Recently, Vankov et al. (2019) considered the case where the emission distribution is a stable distribution, that can take into account heavily tailed volatility. Assuming they knew the exact values of each parameter of the model, they have proposed a likelihood-free particle filter to infer the latent states. The model is as follows:
, with characteristic function
Note that is a stability parameter, a skewness parameter, and and are related to the scale and the position of the distribution. In particular, if and
, the stable distribution is a Gaussian distribution; ifand
, it is a Cauchy distribution; ifand , it is a Lévy distribution. Apart from these examples, the density of stable distribution is untractable (Lombardi, 2007). This has led to the development of ABC techniques for stable distributions (Peters et al., 2012).
All fixed parameters are known except for . The prior of is . Since must stay between and 1 we define the transition kernel (Algorithm 3) is the following manner: