# A probabilistic scheme for joint parameter estimation and state prediction in complex dynamical systems

Many problems in the geophysical sciences demand the ability to calibrate the parameters and predict the time evolution of complex dynamical models using sequentially-collected data. Here we introduce a general methodology for the joint estimation of the static parameters and the forecasting of the state variables of nonlinear, and possibly chaotic, dynamical models. The proposed scheme is essentially probabilistic. It aims at recursively computing the sequence of joint posterior probability distributions of the unknown model parameters and its (time varying) state variables conditional on the available observations. The latter are possibly partial and contaminated by noise. The new framework combines a Monte Carlo scheme to approximate the posterior distribution of the fixed parameters with filtering (or data assimilation) techniques to track and predict the distribution of the state variables. For this reason, we refer to the proposed methodology as nested filtering. In this paper we specifically explore the use of Gaussian filtering methods, but other approaches fit naturally within the new framework. As an illustrative example, we apply three different implementations of the methodology to the tracking of the state, and the estimation of the fixed parameters, of a stochastic two-scale Lorenz 96 system. This model is commonly used to assess data assimilation procedures in meteorology. For this example, we compare different nested filters and show estimation and forecasting results for a 4,000-dimensional system.

## Authors

• 2 publications
• 1 publication
• 10 publications
• ### Nested Gaussian filters for recursive Bayesian inference and nonlinear tracking in state space models

We introduce a new sequential methodology to calibrate the fixed paramet...
03/23/2021 ∙ by Sara Pérez-Vieites, et al. ∙ 0

• ### Nested Sequential Monte Carlo Methods

We propose nested sequential Monte Carlo (NSMC), a methodology to sample...
02/09/2015 ∙ by Christian A. Naesseth, et al. ∙ 0

• ### Robust parameter estimation in dynamical systems via Statistical Learning with an application to epidemiological models

We propose a robust parameter estimation method for dynamical systems ba...
07/28/2020 ∙ by Diego Marcondes, et al. ∙ 0

• ### Kernel-based parameter estimation of dynamical systems with unknown observation functions

A low-dimensional dynamical system is observed in an experiment as a hig...
09/09/2020 ∙ by Ofir Lindenbaum, et al. ∙ 7

• ### Deep Nonlinear Non-Gaussian Filtering for Dynamical Systems

Filtering is a general name for inferring the states of a dynamical syst...
11/14/2018 ∙ by Arash Mehrjou, et al. ∙ 8

• ### Information sensitivity functions to assess parameter information gain and identifiability of dynamical systems

A new class of functions, called the `Information sensitivity functions'...
11/21/2017 ∙ by Sanjay Pant, et al. ∙ 0

• ### Joint state-parameter estimation of a nonlinear stochastic energy balance model from sparse noisy data

While nonlinear stochastic partial differential equations arise naturall...
04/10/2019 ∙ by Fei Lu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

A common feature to many problems in some of the most active fields of science is the need to calibrate (i.e., estimate the parameters) and then forecast the time evolution of high-dimensional dynamical systems using sequentially-collected observations. One can find obvious examples in meteorology, where current models for global weather forecasting involve the tracking of millions of time-varying state variables Clayton13 , as well as in oceanography VanLeeuwen03 or in climate modelling Dee11 . This problem is not constrained to geophysics, though. In biochemistry and ecology it is often necessary to forecast the evolution of populations of interacting species (typically animal and/or vegetal species in ecology and different types of reacting molecules in biochemistry), which usually involves the estimation of the parameters that govern the interaction as well Golightly11 .

### i.1 State of the art

Traditionally, model calibration (i.e., the estimation of the model static parameters) and the tracking and forecasting of the time-varying state variables have been addressed separately. The problem of tracking the state of the system using sequentially-collected observations is often termed data assimilation in geophysics, while it is referred to as stochastic or Bayesian filtering by researchers in computational statistics and applied probability. Carrying out both tasks jointly, parameter estimation and state forecasting, is a hard problem posing several practical and theoretical difficulties.

Many procedures have been suggested over the years (see, e.g., Liu01b ; Aksoy06 ; Evensen09 ; Carvalho10 , as well as Kantas15

for a survey), however they are subject to problems related to observability (i.e., ambiguities), lack of performance guarantees or prohibitive computational demands. Some of the most relevant techniques can be classified in one or more of the categories below.

• State augmentation methods with artificial dynamics: the state vector, which contains the dynamical variables that describe the physical system, is extended with any static unknown parameters (commonly reinterpreted as “slowly changing” dynamical variables) in the model

Andrieu04 ; Kitagawa98 ; Liu01b ; zhang2016joint . Standard filtering (or data assimilation) techniques are then used in order to track and forecast the extended state vector.

• Particle learning techniques: for some models, the posterior probability distribution of the static parameters, conditional on the system state, can be computed in closed form and it depends only on a set of finite-dimensional statistics Storvik02 ; Djuric02 ; Carvalho10 . In a Monte Carlo setting, e.g., for particle filters, this means that the static parameters can be efficiently represented by sampling. Unfortunately, this approach is restricted to very specific models (an attempt to extend this idea to a more general setting can be found in Djuric04 ). The term particle learning was coined in Carvalho10 , although the fundamental ideas were introduced earlier Storvik02 ; Djuric02 .

• Classical importance resampling methods: several authors have studied the performance of classical sequential importance sampling for static parameters Papavasiliou05 ; Papavasiliou06 ; Olsson08 . Unfortunately, such algorithms tend to degenerate quickly over time unless certain conditions are met by the prior and posterior distributions Papavasiliou05 ; Papavasiliou06

or computationally-heavy interpolation schemes are adopted for the static parameters

Olsson08 .

Only in the last few years there have been advances leading to well-principled probabilistic methods that solve the joint problem numerically and supported by rigorous performance analyses Andrieu10 ; Chopin12 ; Crisan18bernoulli . They aim at calculating the posterior probability distribution of all the unknown variables and parameters of the model. From the viewpoint of Bayesian analysis, these conditional, or posterior

, distributions contain all the information relevant for the estimation task. From them, one can compute point estimates of the parameters and states but also quantify the estimation error. However, state-of-the-art methods for Bayesian parameter estimation and stochastic filtering are batch techniques, i.e., they process the whole set of available observations repeatedly in order to produce numerical solutions. For this reason, they are not well suited to problems where observations are collected sequentially and have to be processed as they arrive (or, simply, when the sequence of observations is too long). The popular particle Markov chain Monte Carlo (pMCMC)

Andrieu10 and the sequential Monte Carlo square (SMC) Chopin12 schemes are examples of such batch methods. The nested particle filter (NPF) of Crisan18bernoulli is a purely recursive Monte Carlo method, more suitable than pMCMC and SMC when long sequences of observations have to be processed. However, this technique is still computationally prohibitive in high dimensional settings as it relies on two layers of intertwined Monte Carlo approximations.

While the schemes in Andrieu10 ; Chopin12 ; Crisan18bernoulli fully rely on Monte Carlo approximations in order to approximate the posterior probability distribution of the parameters and the states, there is an alternative class of schemes, often coined recursive maximum likelihood (RML) methods Andrieu04 ; Andrieu12 ; Kantas15 ; Tadic10 , that enable the sequential processing of the observed data as they are collected but do not yield full posterior distributions of the unknowns. They only output point estimates instead. Therefore, it is not possible to quantify the uncertainty of the estimates or forecasts. Moreover, they are subject to various convergence (and complexity) issues, e.g., when the posterior probability distribution is multimodal, when it contains singularities or when the parameter likelihoods cannot be computed exactly.

In the physics literature, approximation schemes have been proposed that exploit the conditional dependences between the static parameters and the dynamic state variables, in a way that resembles the SMC or NPF schemes. The authors of santitissadeekorn2015two

introduce a two-stage filter that alternates the estimation of static parameters (conditional on a fixed state estimate) and the tracking of the dynamic variables (conditional on a fixed estimate of the static parameters). Another alternating scheme, that combines Monte Carlos estimators with ensemble Kalman filters in order to handle the static parameters and dynamic states, can be found in

frei2012sequential .

In Ashley15

, an expectation-maximization (EM) algorithm is used to track a particle whose dynamics are governed by a hidden Markov model. The expectation step involves a (Monte Carlo based) forward-filtering, backward-smoothing step that is computationally heavy and prevents the online application of the method. The authors of

Ye15 investigate a variational scheme (based in the Laplace integral approximation) for data assimilation (including state and parameter estimation) and illustrate it with applications to the Lorenz 63 and Lorenz 96 models in a low dimensional setting. The same task of data assimilation with parameter estimation is tackled in Ito16 . In this case, the estimation of the states and parameter is reduced to an optimization problem that can be solved via an adjoint method for the estimation of a Hessian matrix. The schemes in Ashley15 , Ye15 and Ito16 require to process the data in batches, rather than recursively, and hence they are not well suited for online implementations. A sequential method, based on variational Bayes techniques, that admits an online (recursive) implementation can be found in Vrettas15 . However, the latter contribution is limited to the estimation of the time-varying states and does not deal with unkown static parameters.

### i.2 Contribution

In this paper we propose a general probabilistic scheme to perform the joint task of parameter estimation and state tracking and forecasting. The methodology is Bayesian, i.e., it aims at the computation of the posterior probability distribution of the unknowns given the available data. It involves two layers of estimators, one for the static parameters and another one for the time-varying state variables. It can be interpreted that the state estimators and predictors are nested, or inserted, within a main algorithm that tackles the estimation of the parameters. The estimation of the static parameters and the dynamic variables is carried out in a purely sequential and recursive manner. This property makes the proposed method well-suited for problems where long time series of data have to be handled.

It can be shown that a particular case of the proposed scheme is the NPF of Crisan18bernoulli , which relies on a sequential Monte Carlo sampler in the parameter space and bank of particle filters Gordon93 ; Doucet00 in the space of the dynamic variables. However, the key feature and advantage of the general scheme that we advocate here is the ability to combine different types of algorithms in the two layers of inference (parameters and dynamic variables). Any grid-based method (where the probability distribution of the static parameters is represented by a set of points in the parameter space) can be employed in the first layer, while the computationally-heavy particle filters in the second layer of the NPF can be replaced by simpler algorithms, easier to apply in practical problems.

We have investigated the use of sequential Monte Carlo and quasi-Monte Carlo Gerber15 techniques in the parameter estimation layer. We note that the quasi-Monte Carlo scheme is a deterministic technique, although it formally resembles the Monte Carlo approach (hence the name). In the same vein, an unscented Kalman filter (UKF) can be utilized in the parameter estimation layer, although we have left this for future research. For the second layer, we have assessed two Gaussian filters, namely the extended Kalman filter (EKF) and the ensemble Kalman filter (EnKF). These two types of Gaussian filters have been well-studied in the geophysics literature and there are a number of numerical techniques to ease their practical implementation for large-scale systems (e.g., covariance inflation Anderson09 ; Lihong09 or localization Ott04 ; Houtekamer01 ; vossepoel2007parameter ).

Because the flexibility to combine estimation techniques of different types within the same overall scheme is a key advantage of the proposed methodology, we refer to the resulting algorithms in general as nested hybrid filters

(NHFs). Besides the numerical example described below, we provide a theoretical result on the asymptotic convergence of NHFs that use a sequential Monte Carlo scheme in the first layer (for the static parameters) and finite-variance estimators of the state variables in the second layer. Our analysis shows that the NHF can be biased if the filters in the second layer are (as it is the case in general with approximate Gaussian filters). However, it also ensures that the approximate posterior distribution of the parameters generated by the NHF, consisting of

samples in the parameter space, converges to a well-defined limit distribution with rate under mild assumptions.

To illustrate the performance of the methodology, we present the results of computer simulations with a stochastic two-scale Lorenz 96 model Arnold13 with underlying chaotic dynamics. In meteorology, the two-scale Lorenz 96 model is commonly used as a benchmark system for data assimilation VanLeeuwen10 and parameter estimation techniques Hakkarainen11 because it displays the basic physical features of atmospheric dynamics Arnold13 (e.g., convection and sensitivity to perturbations). We have implemented, and compared numerically, four NHFs that combine Monte Carlo, quasi-Monte Carlo, EKF and EnKF schemes in different ways. All the combinations that we have tried yield significant reductions of running times in comparison with the NPF for this model, without a significant loss of accuracy. We report simulation results for systems with up to 5,000 dynamical variables to track and forecast.

### i.3 Organization of the paper

The rest of the paper is organized as follows. After a brief comment on notation, we describe in Section II the class of (stochastic) dynamical systems of interest. NHFs are introduced and explained in Section III. The asymptotic convergence theorem is stated and discussed in Section IV. In Section V, the stochastic Lorenz 96 model which is used in the simulations is described and then, some illustrative numerical results are presented in Section VI. Finally, Section VII is devoted to the conclusions.

### i.4 Notation

We denote vectors and matrices by bold-face letters, either lower-case (for vectors) or upper-case (for matrices). Scalar magnitudes are denoted using regular-face letters. For example, and are scalars, is a vector and is a matrix.

Most of the magnitudes of interest in this paper are random vectors (r.v.’s). If is a -dimensional r.v. taking values in , we use the generic notation

for its probability density function (pdf). This is an argument-wise notation. If we have two r.v.’s,

and , we write and for their respective pdf’s, which are possibly different. In a similar vein, denotes the joint pdf of the two r.v.’s and denotes the conditional pdf of given . We find this simple notation convenient for the presentation of the model and methods and introduce a more specific terminology only for the analysis of convergence. We assume, for simplicity, that all random magnitudes can be described by pdf’s with respect to the Lebesgue measure. Notation is read as “the r.v. is distributed according to the pdf ”.

## Ii Dynamical model and problem statement

### ii.1 State space models

We are interested in systems that can be described by a multidimensional stochastic differential equation (SDE) of the form

 dx=f(x,θ)dt+σdw (1)

where denotes continuous time, is the -dimensional system state, is a nonlinear function parametrized by a fixed vector of unknown parameters, , is a scale parameter that controls the intensity of the stochastic perturbation and is a

vector of independent standard Wiener processes. Very often, the underlying ordinary differential equation (ODE)

describes some peculiar dynamics inherent to the system of interest (e.g., many of the systems of interest in geophysics are chaotic) and the addition of the perturbation accounts for model errors or other sources of uncertainty.

Equation (1) does not have a closed-form solution for a general nonlinear function and, therefore, it has to be discretized for its numerical integration. A discretization scheme with fixed step-size yields, in general, a discrete-time stochastic dynamical system of the form

 ¯xk=¯xk−1+F(¯xk−1,θ,h,σwk), (2)

where denotes discrete time, is the system state at time and is a r.v. of dimension that results from the integration of independent Wiener processes. Since the integral of a Wiener process over an interval of length

is a Gaussian random variable with zero mean and variance

, the r.v. is also Gaussian, with mean and covariance matrix , where is the identity matrix. This was denoted as . The function depends on the choice of discretization scheme. The simplest one is the Euler-Maruyama method, which yields Gard88

 ¯xk=¯xk−1+hf(¯xk−1,θ)+σwk, (3)

i.e., the noise is additive, with , and . For a Runge-Kutta method of order , as a more sophisticated example, and the function results from applying times, with a Gaussian perturbation passing through the nonlinearity at each of these intermediate steps. See Gard88 for details on various integration methods for SDEs. In the sequel, we work with the general Eq. (2).

We assume that the system of Eq. (2) can be observed every discrete-time steps (i.e., every continuous-time units). The -th observation is a r.v. of dimension that we model as

 yn=g(xn,θ)+vn, (4)

, where is the system state at the time of the -th observation (continuous time ), is a transformation that maps the state into the observation space and is a -mean observational-noise vector with covariance matrix .

We can re-write the state Eq. (2) in the time scale of the observations (i.e., discrete-time rather than ) as

 xn=xn−1+Fm(xn−1,θ,h,σwn), (5)

, where the notation indicates that Eq. (2) is applied consecutive times in order to move from to . Note that, as a consequence, the noise vector in Eq. (5) has dimension .

### ii.2 Probabilistic representation and problem statement

The state equation (5) together with the observation equation (4) describe a (stochastic) state space model. The goal is to design methods for the recursive estimation of both the static parameters and the states , . Note that the latter implies the estimation of the sequence , , i.e., the states between observations instants have to be estimated as well.

We adopt a Bayesian approach to this task. From this point of view, both the parameters and the sequence of states are random and we aim at computing their respective probability distributions conditional on the available data, i.e., the sequence of observations . The problem is best described if we replace the functional representation of the state space model in Eqs. (5) and (4) by an equivalent one in terms of pdf’s 111The pdf’s in model (6)–(9) always exist if the functions and are differentiable. There are problems for which the conditional probability distribution of conditional on , which we may denote as , may not have a density with respect to the Lebesgue measure and model (6)–(9) would not be well defined the way it is written. Even in that case, however, the proposed methodology could be applied using approximation filters in the second layer which do not depend on the differentiability of , such as particle filters or EnKF’s.. To be specific, the probabilistic representation consists of the following elements

 x0 ∼ p(x0) (6) θ ∼ p(θ) (7) xn ∼ p(xn|xn−1,θ) (8) yn ∼ p(yn|xn,θ) (9)

where and are, respectively, the a priori pdf’s of the system state and the parameter vector at time ( as well), is the conditional pdf of given the state and the parameters in , and is the conditional pdf of the observation given the state and the parameters.

We note that:

• The priors and can be understood as a probabilistic characterization of uncertainty regarding the system initial condition. If the initial condition were known, could be replaced by a Dirac delta allocating probability 1 at that point.

• The pdf does not have, in general, a closed-form expression because of the nonlinearity . However, it is usually straightforward to simulate given and using Eq. (5) and this is sufficient for many methods to work.

• The observations are conditionally independent given the states and the parameters. If the observational noise is Gaussian, then .

From a Bayesian perspective, all the information relevant for the characterization of and at discrete time (corresponding to ) is contained in the joint posterior pdf , where . The latter density cannot be computed exactly in general and the goal of this paper is to describe a class of flexible and efficient recursive methods for its approximation.

We will show that one way to attain this goal is to tackle the approximation of the sequence of posterior pdf’s of the parameters, , . This yields, in a natural way, approximations for and for each , as well as predictions for the densities of the intermediate states, , for .

## Iii Nested Hybrid Filtering

### iii.1 Importance sampling for parameter estimation

In order to introduce the proposed scheme of nested hybrid filters, let us consider the approximation of the -th posterior probability distribution of the parameters, with pdf , using classical importance sampling Robert04 . In particular, let be an arbitrary proposal pdf for the parameter vector and assume that whenever .

Assume that the posterior at time , , is available. Then the posterior pdf at time

can be expressed, via Bayes’ theorem, as

 p(θ|y1:n)∝p(yn|θ,y1:n−1)p(θ|y1:n−1), (10)

where the proportionality constant, , is independent of . Expression (10) enables the application of the importance sampling method to approximate integrals w.r.t. the posterior pdf (i.e., to approximate the statistics of this probability distribution). Specifically, if we

• draw independent and identically distributed (i.i.d.) samples from , denoted , ,

• compute importance weights of the form

 ~win=p(yn|θin,y1:n−1)p(θin|y1:n−1)qn(θin),

and normalize them to obtain

 win=~win∑Nk=1~wkn,i=1,…,N,

then it can be proved Robert04 that

 limN→∞N∑i=1winf(θin)=∫f(θ)p(θ|y1:n)dθalmost surely (a.s.) (11)

for any integrable function under mild regularity assumptions. In this way one could estimate the value of , e.g.,

 θNn=N∑i=1winθin≃∫θp(θ|y1:n)dθ=:E[θ|y1:n],

where denotes the expected value of conditional on the observations . We could also estimate the mean square error (MSE) of this estimator, as

 (12)

The choice of is, of course, key to the complexity and the performance of importance sampling schemes. One particularly simple choice is , which reduces the importance sampling algorithm to

1. drawing i.i.d. samples , , from , and

2. computing normalized importance weights

Unfortunately, this method is not practical because

• it is not possible to draw exactly from , since this pdf is unknown, and

• the likelihood function cannot be evaluated exactly either.

In the sequel we tackle the two issues above and, in doing so, obtain a general scheme for the approximation of the posterior distribution of the parameter vector and the state vector , i.e., the distribution with pdf .

### iii.2 Sequential Monte Carlo hybrid filter

It is well known that the likelihood can be approximated using filtering algorithms Andrieu10 ; Koblents15 . To be specific, function can be written as the integral

 un(θ)=∫p(yn|xn,θ)p(xn|θ,y1:n−1)dθ (13)

where, in turn, the predictive density is

 p(xn|θ,y1:n−1)=∫p(xn|θ,xn−1)p(xn−1|θ,y1:n−1)dxn−1 (14)

and

 p(xn−1|θ,y1:n−1)∝p(yn−1|θ,xn−1)p(xn−1|θ,y1:n−2). (15)

Given a fixed parameter vector and a prior pdf , the sequence of likelihoods can be computed by recursively applying Eqs. (13), (14) and (15) for .

Let us now assume that we are given a sequence of parameter vectors and we are interested in computing the likelihood of the last vector, . Following Crisan18bernoulli , one can compute a sequence of approximate likelihoods , , using the recursion

 ^p(xn−1|θn−1,y1:n−1) ∝ p(yn−1|θn−1,xn−1)^p(xn−1|θn−1,y1:n−2) (16) ^p(xn|θn,y1:n−1) := ∫p(xn|θn,xn−1)^p(xn−1|θn−1,y1:n−1)dxn−1 (17) ^un(θn) := ∫p(yn|θn,xn)^p(xn|θn,y1:n−1)dxn (18)

which starts with the initial density . It can be proved, using the same type of continuity arguments in Crisan18bernoulli , that the approximation error

 |uk(θ′)−^uk(θ′)|, (19)

can be kept bounded, for any , provided some simple assumptions on the state space model and the sequence are satisfied. Note that, in expression (19), is the actual likelihood calculated by iterating (13), (14) and (15) for , while is the approximation computed using the sequence and recursion (16)–(18).

The recursive approximation scheme for can be combined with the “naive” IS procedure of Section III.1 to yield a general (and practical) method for the approximation of the sequence of a posteriori probability distributions of the parameter vector , hereafter denoted as

 μn(dθ):=p(θ|y1:n)dθ.

We refer to the proposed scheme as a nested hybrid filter (NHF) and provide a detailed outline in Algorithm 1.

###### Algorithm 1

Nested hybrid filter (NHF).

Inputs:

• Number of Monte Carlo samples, .

• A priori pdf’s and .

• A Markov kernel which, given , generates jittered parameters .

Procedure:

1. Initialization

Draw , i.i.d. samples from .

2. Recursive step

1. For :

1. Draw from .

2. Approximate using a filtering algorithm.

3. Use this approximation to compute the estimate

 ^un(¯θin)=∫p(yn|¯θin,xn)^p(xn|¯θin,y1:n−1)dxn (20)

and let be the normalized weight of .

2. Resample the discrete distribution

 ¯μNn(dθ)=N∑i=1winδ¯θin(dθ) (21)

times with replacement in order to obtain the particle set and the approximate probability measure .

Outputs: A set of particles and a probability measure .

Algorithm 1 is essentially a sequential Monte Carlo (SMC) method, often known as a particle filter Gordon93 ; Liu98 ; DelMoral04 . At each time step , the output of the algorithm is an estimate of the posterior probability distribution . Specifically we construct the discrete and random probability measure

 μNn(dθ)=1N∑δθin(dθ) (22)

that can be used to approximate any integrals w.r.t. the true probability measure . For example, one can estimate any posterior expectations of the parameter vector given the observations , namely

 E[θ|y1:n]=∫θp(θ|y1:n)dθ≃∫θμNn(dθ)=1N∑iθin=:θNn (23)

Since we have constructed a complete distribution, statistical errors can be estimated as well. The a posteriori covariance matrix of vector can be approximated as

 E[(θ−E[θ|y1:n])(θ−E[θ|y1:n])⊤|y1:n]≃1NN∑i=1(θin−θNn)(θin−θNn)⊤=:PNn. (24)

As a byproduct, Algorithm 1 also yields an approximate predictive pdf for , namely

 ^p(xn|y1:n−1)=N∑i=1win^p(xn|θin,y1:n−1).

If one computes the approximate filter, as well, then the joint probability distribution of and conditioned on (denoted ) can be approximated as

 πNn(dθ×dxn)=N∑i=1win^p(xn|θ,y1:n)δ¯θin(dθ)dxn.

The scheme of Algorithm 1 is referred to as nested because the SMC algorithm generates, at each time step , a set of samples and, for each sample , we embed a filter in the state space in order to compute the pdf and the approximate likelihood . The term hybrid is used because the embedded filters need not be Monte Carlo methods –a variety of techniques can be used and in this paper we focus on Gaussian filters, which are attractive because of their (relative) computational simplicity. A scheme with nested particle filters was thoroughly studied in Crisan18bernoulli ; Crisan17 .

Let us finally remark that the NHF scheme relies on two approximations:

• Jittering of the parameters: The difficulty of drawing samples from can be circumvented if we content ourselves with an approximate sampling step. In particular, if we have computed a Monte Carlo approximation at time (with some of the samples replicated because of the resampling step) then we can generate new particles , , where is a Markov kernel, i.e., a probability distribution for conditional on . See Section IV for guidelines on the selection of this kernel. Intuitively, we can either jitter a few particles with arbitrary variance (while leaving most of them unperturbed) or jitter all particles with a controlled variance that decreases as increases.

• Estimation of likelihoods: The sequential approximation of Eqs. (16)–(18) yields biased estimates of the likelihoods Crisan18bernoulli . This is discussed in Section IV. In Appendix A we provide details on the computation of the estimates and using an ensemble Kalman Filter (EnKF). Other techniques (e.g., particle filters as in Crisan18bernoulli or sigma-point Kalman filters Ambadan09 ; Arasaratnam09 ) can be used as well.

### iii.3 Sequential quasi Monte Carlo hybrid filter

The SMC method in the first layer of Algorithm 1 can be replaced by other schemes that rely on the point-mass representation of the posterior probability distribution . It is possible to devise procedures based, for instance, on an unscented Kalman filter Julier04 or other sigma-point Kalman methods Ambadan09 ; Arasaratnam09 to obtain a Gaussian approximation of . Such Gaussian approximations, however, can be misleading when the posterior distribution is multimodal.

In this subsection, we describe a NHF method (hence, of the same class as Algorithm 1) where the SMC scheme is replaced by a sequential quasi-Monte Carlo (SQMC) procedure of the type introduced in Gerber15 . The term quasi-Monte Carlo (QMC) refers to a class of deterministic methods for numerical integration Niederreiter92 that employ low-discrepancy point sets (e.g., Halton sequences Halton64 or Sobol sequences Bratley88 ), instead of random sample sets, for the approximation of multidimensional integrals. In the context of QMC, discrepancy is defined to quantify how uniformly the points in a sequence are distributed into an arbitrary set . Hence, the lowest discrepancy is attained when these points are equi-distributed. The main advantage of (deterministic) QMC methods over (random) Monte Carlo schemes is that they can attain a faster rate of convergence relative to the number of points in the grid, . The main disadvantage is that the generation of low-discrepancy points in high-dimensional spaces can be computationally very costly compared to the generation of random Monte Carlo samples Martino18 . Within a NHF, the use of QMC should lead to a better performance/complexity trade-off as long as the parameter dimension, , is relatively small. This is illustrated numerically for a stochastic two-scale Lorenz 96 model in Section VI.

The NHF based on the SQMC methodology of Gerber15 can be obtained from Algorithm 1 if we replace the sampling and resampling steps typical of the SMC schemes by the generation of low-discrepancy point sets. Let be a Halton sequence of low-discrepancy (deterministic) uniform samples Halton64 . These uniform samples can be used to generate low-discrepancy variates from other distributions via a number of methods 222One can use a number of techniques used to produce random samples from a given uniform source. See Martino18 for a comprehensive description of the field, both for single and multivariate distributions.. For example, the Box-Muller transformation BoxMuller

can be used to generate pairs of independent, standard, normally distributed pseudo-random numbers. We explicitly indicate the use of low-discrepancy uniform numbers,

, in the generation of samples with general distributions by conditioning on . Hence, drawing the -th sample from the prior parameter pdf, , is now replaced by . In order to propagate the -th sample at time , , into time , we draw from the kernel . If sampling is needed in the second layer of filters (in order to compute the estimates and ) we use additional Halton sequences in a similar way.

In order to keep the low-discrepancy property across the resampling step, we additionally introduce the following functions (see Gerber15 for details).

• A discrepancy-preserving bijective map . Several choices are possible for this function. Following Gerber15 , here we assume

 ψ(¯θin)=⎡⎣1+exp⎛⎝−¯θin−θ−nθ+n−θ−n⎞⎠⎤⎦−1, (25)

where and are the -dimensional vectors whose -th components are, respectively,

 [θ−n]j=mNn,j−2s2Nn,jand[θ+n]j=mNn,j+2s2Nn,j,

whereas and , , are component-wise means and variances.

• The inverse Hilbert curve, , which is a continuous fractal space-filling curve that provides a locality-preserving map between a 1-dimensional and a -dimensional space Moon01 .

The SQMC-based NHF is outlined in Algorithm 2.

###### Algorithm 2

Sequential quasi Monte Carlo nested hybrid filter (SQMC-NHF).

Inputs:

• Number of Monte Carlo samples, .

• A priori pdf’s and .

• A Markov kernel which, given , generates jittered parameters .

Procedure:

1. Initialization

1. Generate QMC uniform samples in . Draw , .

2. Recursive step, .

1. For :

1. If , then draw , else draw , for .

2. Approximate .

3. Use this approximation to compute the estimate

 ^un(¯θin)=∫p(yn|¯θin,xn)^p(xn|¯θin,y1:n−1)dxn. (26)

and let be the normalized weight of .

2. Generate a QMC point set in ; let .

3. Hilbert sort: find a permutation such that

 (h∘ψ)(¯θb(1)n)≤…≤(h∘ψ)(¯θb(N)n),if dθ≥2¯θb(1)n≤…≤¯θb(N)n,if dθ=1.
4. Resampling: find a permutation such that . For , set if, and only if,

 j−1∑k=1wb(k)n

Outputs: A set of particles and a probability measure .

## Iv Convergence analysis

The nested filtering schemes of Section III admit various implementations depending on how we choose to approximate the conditional pdf which, in turn, is needed to estimate the likelihood function and compute the importance weights , .

For each choice of approximation method, the estimate may behave differently and yield different convergence properties. Here we assume that is a random variable with finite mean

and finite moments up to some prescribed order

. Specifically, we make following assumption.

###### A. 1

Given , the estimator is random and can be written as

 ^un(θ)=¯un(θ)+mn(θ), (27)

where is a zero-mean r.v. satisfying for some prescribed . Furthermore, the mean has the form

 ¯un(θ)=un(θ)+bn(θ), (28)

where is a deterministic and absolutely bounded bias function.

In the sequel, we use to denote the support set of the parameter vector . Given a real function , its absolute supremum is indicated as . The set of absolutely bounded real functions on is denoted , i.e., . For our analysis we assume that and, since we have also assumed the bias function to be bounded, it follows that , i.e., . To be precise, we impose the following assumption.

###### A. 2

Given a fixed sequence of observations , the family of functions satisfies the following inequalities for each :

1. , and

2. for any .

Since , A.2.1 follows from assumption A.1. Similarly, if for all then A.2.2 is a natural assumption (since is an estimator of a positive magnitude).

We shall prove that, because of the bias , the approximation converges to the perturbed probability measure induced by the mean function , instead of the true posterior probability measure induced by the true likelihood function .

To be specific, the sequence of posterior measures , , can be constructed recursively, starting from a prior , by means of the projective product operation Bain08 , denoted . When is a positive and bounded function and is a probability measure, the new measure is defined in terms of its integrals. In particular, if then

 ∫a(θ)(u⋅μ)(dθ):=∫a(θ)u(θ)μ(dθ)∫u(θ)μ(dθ).

For conciseness, hereafter we use the shorthand

 (a,μ):=∫a(θ)μ(dθ)

for the integral of a function w.r.t. a measure . With this notation, we can write

 (a,μn)=(a,un⋅μn−1)=(aun,μn−1)(un,μn−1). (29)

If, instead of the true likelihood , we use the biased function to update the posterior probability measure associated to the parameter vector at each time then we obtain the new sequence of measures

 ¯μ0=μ0,¯μn=¯un⋅¯μn−1,n=1,2,...,

where, according to the definition of the projective product,

 (a,¯μn)=(a¯un,¯μn−1)(¯un,¯μn−1)

for any integrable function . Note that the two sequences, and , start from the same prior . Obviously, we recover the original sequence, i.e, , when the bias vanishes, .

In this section we prove that the approximation generated by a generic nested filter that satisfies A.1 and A.2 converges to in , for each , under additional regularity assumption on the jittering kernel .

###### A. 3

The kernel used in the jittering step satisfies the inequality

 supθ′∈D∫|h(θ)−h(θ′)|κN(dθ|θ′)≤cκ∥h∥∞√N (30)

for every and some constant independent of .

A simple kernel that satisfies A.3 is Crisan18bernoulli

 κN(dθ|θ′)=(1−ϵN)δθ′(dθ)+ϵNκ(dθ|θ′),

where and is an arbitrary Markov kernel with mean and finite variance, for example , where and is the identity matrix. Intuitively, this kind of kernel changes each particle with probability and leaves it unmodified with probability .

Finally, we can state a general result on the convergence of Algorithm 1. For a real random variable and , let denote the norm, i.e. .