# Nudging the Particle Filter

We investigate a new sampling scheme to improve the performance of particle filters in scenarios where either (a) there is a significant mismatch between the assumed model dynamics and the actual system producing the available observations, or (b) the system of interest is high dimensional and the posterior probability tends to concentrate in relatively small regions of the state space. The proposed scheme generates nudged particles, i.e., subsets of particles which are deterministically pushed towards specific areas of the state space where the likelihood is expected to be high, an operation known as nudging in the geophysics literature. This is a device that can be plugged into any particle filtering scheme, as it does not involve modifications in the classical algorithmic steps of sampling, computation of weights, and resampling. Since the particles are modified, but the importance weights do not account for this modification, the use of nudging leads to additional bias in the resulting estimators. However, we prove analytically that particle filters equipped with the proposed device still attain asymptotic convergence (with the same error rates as conventional particle methods) as long as the nudged particles are generated according to simple and easy-to-implement rules. Finally, we show numerical results that illustrate the improvement in performance and robustness that can be attained using the proposed scheme. In particular, we show the results of computer experiments involving misspecified Lorenz 63 model, object tracking with misspecified models, and a large dimensional Lorenz 96 chaotic model. For the examples we have investigated, the new particle filter outperforms conventional algorithms empirically, while it has only negligible computational overhead.

## Authors

• 14 publications
• 10 publications
• ### Lattice Particle Filters

A standard approach to approximate inference in state-space models isto ...
01/10/2013 ∙ by Dirk Ormoneit, et al. ∙ 0

• ### A Stochastic Covariance Shrinkage Approach to Particle Rejuvenation in the Ensemble Transform Particle Filter

Rejuvenation in particle filters is necessary to prevent the collapse of...
09/20/2021 ∙ by Andrey A Popov, et al. ∙ 0

• ### Estimating parameters of nonlinear systems using the elitist particle filter based on evolutionary strategies

In this article, we present the elitist particle filter based on evoluti...
04/14/2016 ∙ by Christian Huemmer, et al. ∙ 0

• ### Graph-Based Multiobject Tracking with Embedded Particle Flow

Seamless situational awareness provided by modern radar systems relies o...
03/16/2021 ∙ by Wenyu Zhang, et al. ∙ 0

• ### Bernoulli Race Particle Filters

When the weights in a particle filter are not available analytically, st...
03/03/2019 ∙ by Sebastian M. Schmon, et al. ∙ 1

• ### SPADE: Sequential-clustering Particle Annihilation via Discrepancy Estimation

For an empirical signed measure μ = 1/N(∑_i=1^P δ__i - ∑_i=1^M δ__i), pa...
05/11/2020 ∙ by Sihong Shao, et al. ∙ 0

• ### Bias of Particle Approximations to Optimal Filter Derivative

In many applications, a state-space model depends on a parameter which n...

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Background

State-space models (SSMs) are ubiquitous in many fields of science and engineering, including weather forecasting, mathematical finance, target tracking, machine learning, population dynamics, etc., where inferring the states of dynamical systems from data plays a key role.

A SSM comprises a pair of stochastic processes and called signal process and observation process, respectively. The conditional relations between these processes are defined with a transition and an observation model (also called likelihood model) where observations are conditionally independent given the signal process, and the latter is itself a Markov process. Given an observation sequence,

, the filtering problem in SSMs consists in the estimation of expectations with respect to the posterior probability distribution of the hidden states, conditional on

, which is also referred to as the filtering distribution.

Apart from a few special cases, neither the filtering distribution nor the integrals (or expectations) with respect to it can be computed exactly; hence, one needs to resort to numerical approximations of these quantities. Particle filters (PFs) have been a classical choice for this task since their introduction by Gordon et al (1993); see also Kitagawa (1996); Liu and Chen (1998); Doucet et al (2000, 2001). The PF constructs an empirical approximation of the posterior probability distribution via a set of Monte Carlo samples (usually termed particles) which are modified or killed sequentially as more data are taken into account. These samples are then used to estimate the relevant expectations. The original form of the PF, often referred to as the bootstrap particle filter (BPF), has received significant attention due to its efficiency in a variety of problems, its intuitive appeal and its straightforward implementation. A large body of theoretical work concerning the BPF has also been compiled. For example, it has been proved that the expectations with respect to the empirical measures constructed by the BPF converge to the expectations with respect to the true posterior distributions when the number of particles is large enough (Del Moral and Guionnet, 1999; Chopin, 2004; Künsch, 2005; Douc and Moulines, 2008) or that they converge uniformly over time under additional assumptions related to the stability of the true distributions (Del Moral and Guionnet, 2001; Del Moral, 2004).

Despite the success of PFs in relatively low dimensional settings, their use has been regarded impractical in models where and

are sequences of high-dimensional random variables. In such scenarios, standard PFs have been shown to

collapse (Bengtsson et al, 2008; Snyder et al, 2008). This problem has received significant attention from the data assimilation community. The high-dimensional models which are common in meteorology and other fields of Geophysics are often dealt with via an operation called nudging (Hoke and Anthes, 1976; Malanotte-Rizzoli and Holland, 1986, 1988; Zou et al, 1992). Within the particle filtering context, nudging can be defined as a transformation of the particles, which are pushed towards the observations using some observation-dependent map (van Leeuwen, 2009, 2010; Ades and van Leeuwen, 2013, 2015). If the dimensions of the observations and the hidden states are different, which is often the case, a gain matrix is computed in order to perform the nudging operation. In van Leeuwen (2009, 2010); Ades and van Leeuwen (2013, 2015), nudging is performed after the sampling step of the particle filter. The importance weights are then computed accordingly, so that they remain proper. Hence, nudging in this version amounts to a sophisticated choice of the importance function that generates the particles. It has been shown (numerically) that the schemes proposed by van Leeuwen (2009, 2010); Ades and van Leeuwen (2013, 2015) can track high-dimensional systems with a low number of particles. However, generating samples from the nudged proposal requires costly computations for each particle and the evaluation of weights becomes heavier as well. It is also unclear how to apply existing nudging schemes when non-Gaussianity and nontrivial nonlinearities are present in the observation model.

A related class of algorithms includes the so-called implicit particle filters (IPFs) (Chorin and Tu, 2009; Chorin et al, 2010; Atkins et al, 2013). Similar to nudging schemes, IPFs rely on the principle of pushing particles to high-probability regions in order to prevent the collapse of the filter in high-dimensional state spaces. In a typical IPF, the region where particles should be generated is determined by solving an algebraic equation. This equation is model dependent, yet it can be solved for a variety of different cases (general procedures for finding solutions are given by Chorin and Tu (2009) and Chorin et al (2010)). The fundamental principle underlying IPFs, moving the particles towards high-probability regions, is similar to nudging. Note, however, that unlike IPFs, nudging-based methods are not designed to guarantee that the resulting particles land on high-probability regions; it can be the case that nudged particles are moved to relatively low probability regions (at least occasionally). Since an IPF requires the solution of a model-dependent algebraic equation for every particle, it can be computationally costly, similar to the nudging methods by van Leeuwen (2009, 2010); Ades and van Leeuwen (2013, 2015). Moreover, it is not straightforward to derive the map for the translation of particles in general models, hence the applicability of IPFs depends heavily on the specific model at hand.

### 1.2 Contribution

In this work, we propose a modification of the PF, termed the nudged particle filter (NuPF) and assess its performance in high dimensional settings and with misspecified models. Although we use the same idea for nudging that is presented in the literature, our algorithm has subtle but crucial differences, as summarized below.

• First, we define the nudging step not just as a relaxation step towards observations but as a step that strictly increases the likelihood of a subset of particles. This definition paves the way for different nudging schemes, such as using the gradients of likelihoods or employing random search schemes to move around the state-space. In particular, classical nudging (relaxation) operations arise as a special case of nudging using gradients when the likelihood is assumed to be Gaussian. Compared to IPFs, the nudging operation we propose is easier to implement as we only demand the likelihood to increase (rather than the posterior density). Indeed, nudging operators can be implemented in relatively straightforward forms, without the need to solve model-dependent equations.

• Second, unlike the other nudging based PFs, we do not correct the bias induced by the nudging operation during the weighting step. Instead, we compute the weights in the same way they would be computed in a conventional (non-nudged) PF and the nudging step is devised to preserve the convergence rate of the PF, under mild standard assumptions, despite the bias. Moreover, computing biased weights is usually faster than computing proper (unbiased) weights. Depending on the choice of nudging scheme, the proposed algorithm can have an almost negligible computational overhead compared to the conventional PF from which it is derived.

• Finally, we show that a nudged PF for a given SSM (say ) is equivalent to a standard BPF running on a modified dynamical model (denoted ). In particular, model is endowed with the same likelihood function as but the transition kernel is observation-driven in order to match the nudging operation. As a consequence, the implicit model is “adapted to the data” and we have empirically found that, for any sufficiently long sequence , the evidence111Given a data set , the evidence in favour of a model is the joint probability density of conditional on , denoted . (Robert, 2007) in favour of is greater than the evidence in favour of . We can show, for several examples, that this implicit adaptation to the data makes the NuPF robust to mismatches in the state equation of the SSM compared to conventional PFs. In particular, provided that the likelihoods are specified or calibrated reliably, we have found that NuPFs perform reliably under a certain amount of mismatch in the transition kernel of the SSM, while standard PFs degrade clearly in the same scenario.

In order to illustrate the contributions outlined above, we present the results of several computer experiments with both synthetic and real data. In the first example, we assess the performance of the NuPF when applied to a linear-Gaussian SSM. The aim of these computer simulations is to compare the estimation accuracy and the computational cost of the proposed scheme with several other competing algorithms, namely a standard BPF, a PF with optimal proposal function and a NuPF with proper weights. The fact that the underlying SSM is linear-Gaussian enables the computation of the optimal importance function (intractable in a general setting) and proper weights for the NuPF. We implement the latter scheme because of its similarity to standard nudging filters in the literature. This example shows that the NuPF suffers just from a slight performance degradation compared to the PF with optimal importance function or the NuPF with proper weights, while the latter two algorithms are computationally more demanding.

The second and third examples are aimed at testing the robustness of the NuPF when there is a significant misspecification in the state equation of the SSM. This is helpful in real-world applications because practitioners often have more control over measurement systems, which determine the likelihood, than they have over the state dynamics. We present computer simulation results for a stochastic Lorenz 63 model and a maneuvering target tracking problem.

In the fourth example, we present numerical results for a stochastic Lorenz 96 model, in order to show how a relatively high-dimensional system can be tracked without a major increase of the computational effort compared to the standard BPF. For this set of computer simulations we have also compared the NuPF with the Ensemble Kalman filter (EnKF), which is the de facto choice for tackling this type of systems.

Let us remark that, for the two stochastic Lorenz systems, the Markov kernel in the SSM can be sampled in a relatively straightforward way, yet transition probability densities cannot be computed (as they involve a sequence of noise variables mapped by a composition of nonlinear functions). Therefore, computing proper weights for proposal functions other than the Markov kernel itself is, in general, not possible for these examples.

Finally, we demonstrate the practical use of the NuPF on a problem where a real dataset is used to fit a stochastic volatility model using either particle Markov chain Monte Carlo (pMCMC)

(Andrieu et al, 2010) or nested particle filters (Crisan and Miguez, 2018).

### 1.3 Organisation

The paper is structured as follows. After a brief note about notation, we describe the SSMs of interest and the BPF in Section 2. Then in Section 3, we outline the general algorithm and the specific nudging schemes we propose to use within the PF. We prove a convergence result in Section 4 which shows that the new algorithm has the same asymptotic convergence rate as the BPF. We also provide an alternative interpretation of the nudging operation that explains its robustness in scenarios where there is a mismatch between the observed data and the assumed SSM. We discuss the computer simulation experiments in Section 5 and present results for real data in Section 6. Finally, we make some concluding remarks in Section 7.

### 1.4 Notation

We denote the set of real numbers as , while is the space of

-dimensional real vectors. We denote the set of positive integers with

and the set of positive reals with . We represent the state space with and the observation space with .

In order to denote sequences, we use the shorthand notation . For sets of integers, we use . The -norm of a vector is defined by . The norm of a random variable

with probability density function (pdf)

is denoted , for . The Gaussian (normal) probability distribution with mean and covariance matrix is denoted

. We denote the identity matrix of dimension

with .

The supremum norm of a real function is denoted . A function is bounded if and we indicate the space of real bounded functions as . The set of probability measures on is denoted , the Borel -algebra of subsets of is denoted and the integral of a function with respect to a measure on the measurable space is denoted . The unit Dirac delta measure located at is denoted . The Monte Carlo approximation of a measure constructed using samples is denoted as . Given a Markov kernel and a measure , we define the notation .

## 2 Background

### 2.1 State space models

We consider SSMs of the form

 x0 ∼π0(dx0), (1) xt|xt−1 ∼τt(dxt|xt−1), (2) yt|xt ∼gt(yt|xt),t∈N, (3)

where is the system state at time , is the -th observation, the measure

describes the prior probability distribution of the initial state,

is a Markov transition kernel on , and is the (possibly non-normalised) pdf of the observation conditional on the state . We assume the observation sequence is arbitrary but fixed. Hence, it is convenient to think of the conditional pdf as a likelihood function and we write for conciseness.

We are interested in the sequence of posterior probability distributions of the states generated by the SSM. To be specific, at each time we aim at computing (or, at least, approximating) the probability measure which describes the probability distribution of the state conditional on the observation of the sequence . When it exists, we use to denote the pdf of given with respect to the Lebesgue measure, i.e., .

The measure is often termed the optimal filter at time . It is closely related to the probability measure , which describes the probability distribution of the state conditional on and it is, therefore, termed the predictive measure at time . As for the case of the optimal filter, we use to denote the pdf, with respect to the Lebesgue measure, of given .

### 2.2 Bootstrap particle filter

The BPF (Gordon et al, 1993) is a recursive algorithm that produces successive Monte Carlo approximations of and for . The method can be outlined as shown in Algorithm 1.

After an initialization stage, where a set of independent and identically distributed (i.i.d.) samples from the prior are drawn, it consists of three recursive steps which can be depicted as,

 πNt−1→samplingξNt→weighting~πNt→% resamplingπNt. (4)

Given a Monte Carlo approximation computed at time , the sampling step yields an approximation of the predictive measure of the form

 ξNt=1NN∑i=1δ¯x(i)t

by propagating the particles via the Markov kernel . The observation is assimilated via the importance weights , to obtain the approximate filter

 ~πNt=N∑i=1w(i)tδ¯x(i)t,

and the resampling step produces a set of un-weighted particles that completes the recursive loop and yields the approximation

 πNt=1NN∑i=1δx(i)t.

The random measures , and are commonly used to estimate a posteriori expectations conditional on the available observations. For example, if is a function , then the expectation of the random variable conditional on is . The latter integral can be approximated using , namely,

 (φ,ξt) =∫φ(xt)ξt(dxt)≈(φ,ξNt) =∫φ(xt)ξNt(dxt)=1NN∑i=1φ(¯x(i)t).

Similarly, we can have estimators and . Classical convergence results are usually proved for real bounded functions, e.g., if then

 limN→∞|(φ,πt)−(φ,πNt)|=0almost surely (a.s.)

under mild assumptions; see Del Moral (2004); Bain and Crisan (2009) and references therein.

The BPF can be generalized by using arbitrary proposal pdf’s , possibly observation-dependent, instead of the Markov kernel in order to generate the particles in the sampling step. This can lead to more efficient algorithms, but the weight computation has to account for the new proposal and we obtain (Doucet et al, 2000)

 w(i)t∝gt(¯x(i)t)τt(¯x(i)t|x(i)t)qt(¯x(i)t|x(i)t−1,yt), (5)

which can be more costly to evaluate. This issue is related to the nudged PF to be introduced in Section 3 below, which can be interpreted as a scheme to choose a certain observation-dependent proposal . However, the new method does not require that the weights be computed as in (5) in order to ensure convergence of the estimators.

## 3 Nudged Particle Filter

### 3.1 General algorithm

Compared to the standard BPF, the nudged particle filter (NuPF) incorporates one additional step right after the sampling of the particles at time . The schematic depiction of the BPF in (4) now becomes

 πNt−1→samplingξNt→nudging~ξNt→weighting~πNt→% resamplingπNt, (6)

where the new nudging step intuitively consists in pushing a subset of the generated particles towards regions of the state space where the likelihood function takes higher values.

When considered jointly, the sampling and nudging steps in (6) can be seen as sampling from a proposal distribution which is obtained by modifying the kernel in a way that depends on the observation . Indeed, this is the classical view of nudging in the literature (van Leeuwen, 2009, 2010; Ades and van Leeuwen, 2013, 2015). However, unlike in this classical approach, here the weighting step does not account for the effect of nudging. In the proposed NuPF, the weights are kept the same as in the original filter, . In doing so, we save computations but, at the same time, introduce bias in the Monte Carlo estimators. One of the contributions of this paper is to show that this bias can be controlled using simple design rules for the nudging step, while practical performance can be improved at the same time.

In order to provide an explicit description of the NuPF, let us first state a definition for the nudging step.

###### Definition 1.

A nudging operator associated with the likelihood function is a map such that

 ifx′=αytt(x)thengt(x′)≥gt(x) (7)

for every .

Intuitively, we define nudging herein as an operation that increases the likelihood. There are several ways in which this can be achieved and we discuss some examples in Sections 3.2 and 3.3. The NuPF with nudging operator is outlined in Algorithm 2.

It can be seen that the nudging operation is implemented in two stages.

• First, we choose a set of indices that identifies the particles to be nudged. Let denote the number of elements in . We prove in Section 4 that keeping allows the NuPF to converge with the same error rates as the BPF. In Section 3.2 we discuss two simple methods to build in practice.

• Second, we choose an operator that guarantees an increase of the likelihood of any particle. We discuss different implementations of in Section 3.3.

We devote the rest of this section to a discussion of how these two steps can be implemented (in several ways).

### 3.2 Selection of particles to be nudged

The set of indices , that identifies the particles to be nudged in Algorithm 2, can be constructed in several different ways, either random or deterministic. In this paper, we describe two simple random procedures with little computational overhead.

• Batch nudging: Let the number of nudged particles be fixed. A simple way to construct is to draw indices uniformly from without replacement, and then let . We refer to this scheme as batch nudging, referring to selection of the indices at once. One advantage of this scheme is that the number of particles to be nudged, , is deterministic and can be set a priori.

• Independent nudging: The size and the elements of can also be selected randomly in a number of ways. Here, we have studied a procedure in which, for each index , we assign with probability . In this way, the actual cardinality is random, but its expected value is exactly . This procedure is particularly suitable for parallel implementations, since each index can be assigned to (or not) at the same time as all others.

### 3.3 How to nudge

The nudging step is aimed at increasing the likelihood of a subset of individual particles, namely those with indices contained in . Therefore, any map such that when is a valid nudging operator. Typical procedures used for optimisation, such as gradient moves or random search schemes, can be easily adapted to implement (relatively) inexpensive nudging steps. Here we briefly describe a few of such techniques.

• Gradient nudging: If is a differentiable function of , one straightforward way to nudge particles is to take gradient steps. In Algorithm 3 we show a simple procedure with one gradient step alone, and where is a step-size parameter and denotes the vector of partial derivatives of with respect to the state variables, i.e.,

 ∇xtgt=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣∂gt∂x1,t∂gt∂x2,t⋮∂gt∂xdx,t⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦forxt=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣x1,tx2,t⋮xdx,t⎤⎥ ⎥ ⎥ ⎥ ⎥⎦∈X.

Algorithms can obviously be designed where nudging involves several gradient steps. In this work we limit our study to the single-step case, which is shown to be effective and keeps the computational overhead to a minimum. We also note that the performance of gradient nudging can be sensitive to the choice of the step-size parameters , which are, in turn, model dependent222We have found, nevertheless, that fixed step-sizes (i.e., for all ) work well in practice for the examples of Sections 5 and 6..

• Random nudging: Gradient-free techniques inherited from the field of global optimisation can also be employed in order to push particles towards regions where they have higher likelihoods. A simple stochastic-search technique adapted to the nudging framework is shown in Algorithm 4. We hereafter refer to the latter scheme as random-search nudging.

• Model specific nudging: Particles can also be nudged using the specific model information. For instance, in some applications the state vector can be split into two subvectors, and (observed and unobserved, respectively), such that , i.e., the likelihood depends only on and not on . If the relationship between and is tractable, one can first nudge in order to increase the likelihood and then modify in order to keep it coherent with . A typical example of this kind arises in object tracking problems, where positions and velocities have a special and simple physical relationship but usually only position variables are observed through a linear or nonlinear transformation. In this case, nudging would only affect the position variables. However, using these position variables, one can also nudge velocity variables with simple rules. We discuss this idea and show numerical results in Section 5.

### 3.4 Nudging general particle filters

In this paper we limit our presentation to BPFs in order to focus on the key concepts of nudging and to ease presentation. It should be apparent, however, that nudging steps can be plugged into general PFs. More specifically, since the nudging step is algorithmically detached from the sampling and weighting steps, it can be easily used within any PF, even if it relies on different proposals and different weighting schemes. We leave for future work the investigation of the performance of nudging within widely used PFs, such as auxiliary particle filters (APFs) (Pitt and Shephard, 1999).

## 4 Analysis

The nudging step modifies the random generation of particles in a way that is not compensated by the importance weights. Therefore, we can expect nudging to introduce bias in the resulting estimators in general. However, in Section 4.1 we prove that, as long as some basic guidelines are followed, the estimators of integrals with respect to the filtering measure and the predictive measure converge in as with the usual Monte Carlo rate . The analysis is based on a simple induction argument and ensures the consistency of a broad class of estimators. In Section 4.2 we briefly comment on the conditions needed to guarantee that convergence is attained uniformly over time. We do not provide a full proof, but this can be done by extending the classical arguments in Del Moral and Guionnet (2001) or Del Moral (2004) and using the same treatment of the nudging step as in the induction proof of Section 4.1. Finally, in Section 4.3, we provide an interpretation of nudging in a scenario with modelling errors. In particular, we show that the NuPF can be seen as a standard BPF for a modified dynamical model which is “a better fit” for the available data than the original SSM.

### 4.1 Convergence in Lp

The goal in this section is to provide theoretical guarantees of convergence for the NuPF under mild assumptions. First, we analyze a general NuPF (with arbitrary nudging operator and an upper bound on the size of the index set ) and then we provide a result for a NuPF with gradient nudging.

Before proceeding with the analysis, let us note that the NuPF produces several approximate measures, depending on the set of particles (and weights) used to construct them. After the sampling step, we have the random probability measure

 ξNt=1NN∑i=1δ¯x(i)t, (8)

which converts into

 ~ξNt=1NN∑i=1δ~x(i)t (9)

after nudging. Once the weights are computed, we obtain the approximate filter

 ~πNt=N∑i=1w(i)tδ~x(i)t, (10)

which finally yields

 πNt=1NN∑i=1δx(i)t (11)

after the resampling step.

Similar to the BPF, the simple Assumption 1 stated next is sufficient for consistency and to obtain explicit error rates (Del Moral and Miclo, 2000; Crisan and Doucet, 2002; Míguez et al, 2013) for the NuPF, as stated in Theorem 1 below.

###### Assumption 1.

The likelihood function is positive and bounded, i.e.,

 gt(xt)>0and∥gt∥∞=supxt∈X|gt(xt)|<∞

for .

###### Theorem 1.

Let be an arbitrary but fixed sequence of observations, with , and choose any and any map . If Assumption 1 is satisfied and , then

 ∥(φ,πNt)−(φ,πt)∥p≤ct,p∥φ∥∞√N (12)

for every , any , any and some constant independent of .

See Appendix A for a proof.

Theorem 1 is very general; it actually holds for any map , i.e., not necessarily a nudging operator. We can also obtain error rates for specific choices of the nudging scheme. A simple, yet practically appealing, setup is the combination of batch and gradient nudging, as described in Sections 3.2 and 3.3, respectively.

###### Assumption 2.

The gradient of the likelihood is bounded. In particular, there are constants such that

 ∥∇xgt(x)∥2≤Gt <∞

for every and .

###### Lemma 1.

Choose the number of nudged particles, , and a sequence of step-sizes, , in such a way that for some . If Assumption 2 holds and is a Lipschitz test function, then the error introduced by the batch gradient nudging step with can be bounded as,

 ∥∥(φ,ξNt)−(φ,~ξNt)∥∥p≤LGt√N,

where is the Lipschitz constant of , for every .

See Appendix B for a proof.

It is straightforward to apply Lemma 1 to prove convergence of the NuPF with a batch gradient-nudging step. Specifically, we have the following result.

###### Theorem 2.

Let be an arbitrary but fixed sequence of observations, with , and choose a sequence of step sizes and an integer such that

 sup1≤t≤TγtM≤√N.

Let denote the filter approximation obtained with a NuPF with batch gradient nudging. If Assumptions 1 and 2 are satisfied and , then

 ∥(φ,πNt)−(φ,πt)∥p≤ct,p∥φ∥∞√N (13)

for every , any bounded Lipschitz function , some constant independent of for any integer .

The proof is straightforward (using the same argument as in the proof of Theorem 1 combined with Lemma 1) and we omit it here. We note that Lemma 1 provides a guideline for the choice of and . In particular, one can select , where , together with in order to ensure that . Actually, it would be sufficient to set for some constant in order to keep the same error rate (albeit with a different constant in the numerator of the bound). Therefore, Lemma 1

provides a heuristic to balance the step size with the number of nudged particles

333Note that the step sizes may have to be kept small enough to ensure that , so that proper nudging, according to Definition 1, is performed.. We can increase the number of nudged particles but in that case we need to shrink the step size accordingly, so as to keep . Similar results can be obtained using the gradient of the log-likelihood, , if the comes from the exponential family of densities.

### 4.2 Uniform convergence

Uniform convergence can be proved for the NuPF under the same standard assumptions as for the conventional BPF; see, e.g., Del Moral and Guionnet (2001); Del Moral (2004). The latter can be summarised as follows (Del Moral, 2004):

• The likelihood function is bounded and bounded away from zero, i.e., and there is some constant such that .

• The kernel mixes sufficiently well, namely, for any given integer there is a constant such that

 inft>0;(x,x′)∈X2τt+m|t(A|x)τt+m|t(A|x′)>ε

for any Borel set A, where is the composition of the kernels .

When (i) and (ii) above hold, the sequence of optimal filters is stable and it can be proved that

 supt>0∥(φ,πt)−(φ,πNt)∥p≤cp√N

for any bounded function , where is constant with respect to and and is the particle approximation produced by either the NuPF (as in Theorem 1 or, provided , as in Theorem 2) or the BPF algorithms. We skip a formal proof as, again, it is straightforward combination of the standard argument by Del Moral (2004) (see also, e.g., Oreshkin and Coates (2011) and Crisan and Miguez (2017)) with the same handling of the nudging operator as in the proofs of Theorem 1 or Lemma 1 .

### 4.3 Nudging as a modified dynamical model

We have found in computer simulation experiments that the NuPF is consistently more robust to model errors than the conventional BPF. In order to obtain some analytical insight of this scenario, in this section we reinterpret the NuPF as a standard BPF for a modified, observation-driven dynamical model and discuss why this modified model can be expected to be a better fit for the given data than the original SSM. In this way, the NuPF can be seen as an automatic adaptation of the underlying model to the available data.

The dynamic models of interest in stochastic filtering can be defined by a prior measure , the transition kernels and the likelihood functions , for . In this section we write the latter as , in order to emphasize that is parametrised by the observation , and we also assume that every is a normalised pdf in for the sake of clarity. Hence, we can formally represent the SSM defined by (1), (2) and (3) as .

Now, let us assume to be fixed and construct the alternative dynamical model , where

 ~τytt(dxt|xt−1):= (1−εM)τt(dxt|xt−1)+ εM∫δαytt(¯xt)(dxt)τt(d¯xt|xt−1) (14)

is an observation-driven transition kernel, and the nudging operator is a one-to-one map that depends on the (fixed) observation . We note that the kernel jointly represents the Markov transition induced by the original kernel followed by an independent nudging transformation (namely, each particle is independently nudged with probability ). As a consequence, the standard BPF for model coincides exactly with a NuPF for model with independent nudging and operator . Indeed, according to the definition of in (14), generating a sample from is a three-step process where

• we first draw from ,

• then generate a sample

from the uniform distribution

, and

• if then we set , else we set .

After sampling, the importance weight for the BPF applied to model is . This is exactly the same procedure as in the NuPF applied to the original SSM (see Algorithm 2).

Intuitively, one can expect that the observation-driven is a better fit for the data sequence than the original model . Within the Bayesian methodology, a common approach to compare two competing probabilistic models ( and in this case) for a given data set is to evaluate the so-called model evidence (Bernardo and Smith, 1994) for both and .

###### Definition 2.

The evidence (or likelihood) of a probabilistic model for a given data set is the probability density of the data conditional on the model, that we denote as .

We say that is a better fit than for the data set when . Since

 p(y1:t|M0)=∫⋯∫t∏l=1gl(xl)τl(dxl|xl−1)τ0(dx0),

and

 p(y1:t|M1)=∫⋯∫t∏l=1gl(xl)~τyll(dxl|xl−1)τ0(dx0),

the difference between the evidence of and the evidence of depends on the difference between the transition kernels and .

We have empirically observed in several computer experiments that and we argue that the observation-driven kernel implicit to the NuPF is the reason why the latter filter is robust to modelling errors in the state equation, compared to standard PFs. This claim is supported by the numerical results in Sections 5.2 and 5.3, which show how the NuPF attains a significant better performance than the standard BPF, the auxiliary PF Pitt and Shephard (1999) or the extended Kalman filter (Anderson and Moore, 1979) in scenarios where the filters are built upon a transition kernel different from the one used to generate the actual observations.

While it is hard to show that for every NuPF, it is indeed possible to guarantee that the latter inequality holds for specific nudging schemes. An example is provided in Appendix C, where we describe a certain nudging operator and then proceed to prove that , for that particular scheme, under some regularity conditions on the likelihoods and transition kernels.

## 5 Computer simulations

In this section, we present the results of several computer experiments. In the first one, we address the tracking of a linear-Gaussian system. This is a very simple model which enables a clearcut comparison of the NuPF and other competing schemes, including a conventional PF with optimal importance function (which is intractable for all other examples) and a PF with nudging and proper importance weights. Then, we study three nonlinear tracking problems:

• a stochastic Lorenz 63 model with misspecified parameters,

• a maneuvering target monitored by a network of sensors collecting nonlinear observations corrupted with heavy-tailed noise,

• and, finally, a high-dimensional stochastic Lorenz 96 model444For the experiments involving Lorenz 96 model, simulation from the model is implemented in C++ and integrated into Matlab. The rest of the simulations are fully implemented in Matlab..

We have used gradient nudging in all experiments, with either (deterministically, with batch nudging) or (with independent nudging). This ensures that the assumptions of Theorem 1 hold. For simplicity, the gradient steps are computed with fixed step sizes, i.e., for all . For the object tracking experiment, we have used a large step-size, but this choice does not affect the convergence rate of the NuPF algorithm either.

### 5.1 A high-dimensional, inhomogeneous Linear-Gaussian state-space model

In this experiment we compare different PFs implemented to track a high-dimensional linear Gaussian SSM. In particular, the model under consideration is

 x0 ∼N(0,Idx), (15) xt|xt−1 ∼N(xt−1,Q), (16) yt|xt ∼N(Ctxt,R), (17)

where are hidden states, are observations, and and are the process and the observation noise covariance matrices, respectively. The latter are diagonal matrices, namely and , where , and . The sequence defines a time-varying observation model. The elements of this sequence are chosen as random binary matrices, i.e., where each entry is generated as an independent Bernoulli random variable with . Once generated, they are fixed and fed into all algorithms we describe below for each independent Monte Carlo run.

We compare the NuPF with three alternative PFs. The first method we implement is the PF with the optimal proposal pdf , abbreviated as Optimal PF. The pdf leads to an analytically tractable Gaussian density for the model (15)–(17) (Doucet et al, 2000) but not in the nonlinear tracking examples below. Note, however, that at each time step, the mean and covariance matrix of this proposal have to be explicitly evaluated in order to compute the importance weights.

The second filter is a nudged PF with proper importance weights (NuPF-PW). In this case, we treat the generation of the nudged particles as a proposal function to be accounted for during the weighting step. To be specific, the proposal distribution resulting from the NuPF has the form

 ~τt(dxt|xt−1)=(1−ϵN)τt(dxt|xt−1)+ϵN¯τt(dxt|xt−1), (18)

where and

 ¯τt(dxt|xt−1)=∫δαytt(¯xt)(dxt)τt(d¯xt|xt−1).

The latter conditional distribution admits an explicit representation as a Gaussian for model (15)-(17) when the operator is designed as a gradient step, but this approach is intractable for the examples in Section 5.2 and Section 5.4. Note that is a mixture of two time-varying Gaussians and this fact adds to the cost of the sampling and weighting steps. Specifically, computing weights for the NuPF-PW is significantly more costly, compared to the BPF or the NuPF, because the mixture (18) has to be evaluated together with the likelihood and the transition pdf.

The third tracking algorithm implemented for model (15)–(17) is the conventional BPF.

For all filters, we have set the number of particles as555When is increased the results are similar for the NuPF, the optimal PF and the NuPF-PW larger number particles, as they already perform close to optimally for , and only the BPF improves significantly. . In order to implement the NuPF and NuPF-PW schemes, we have selected the step size . We have run 1,000 independent Monte Carlo runs for this experiment. To evaluate different methods, we have computed the empirical normalised mean squared errors (NMSEs). Specifically, the NMSE for the -th simulation is

 ¯¯¯¯¯¯¯¯¯¯¯NMSE(j)=∑tft=1∥¯xt−^xt(j)∥22∑tft=1∥xt∥22, (19)

where is the exact posterior mean of the state conditioned on the observations up to time and is the estimate of the state vector in the -th simulation run. Therefore, the notation implies the normalised mean squared error is computed with respect to

. In the figures, we usually plot the mean and the standard deviation of the sample of errors,

.

The results are shown in Fig. 1. In particular, in Fig. 1(a), we observe that the