 # Kernel embedding of maps for sequential Bayesian inference: The variational mapping particle filter

In this work, a novel sequential Monte Carlo filter is introduced which aims at efficient sampling of high-dimensional state spaces with a limited number of particles. Particles are pushed forward from the prior to the posterior density using a sequence of mappings that minimizes the Kullback-Leibler divergence between the posterior and the sequence of intermediate densities. The sequence of mappings represents a gradient flow. A key ingredient of the mappings is that they are embedded in a reproducing kernel Hilbert space, which allows for a practical and efficient algorithm. The embedding provides a direct means to calculate the gradient of the Kullback-Leibler divergence leading to quick convergence using well-known gradient-based stochastic optimization algorithms. Evaluation of the method is conducted in the chaotic Lorenz-63 system, the Lorenz-96 system, which is a coarse prototype of atmospheric dynamics, and an epidemic model that describes cholera dynamics. No resampling is required in the mapping particle filter even for long recursive sequences. The number of effective particles remains close to the total number of particles in all the experiments.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Methodology

### 1.1 Sequential Bayesian estimation

We assume the estimation problem is encompassed of a dynamical model , which predicts the state from a previous state, and an observational model , which transforms the state from the (hidden) state space to the observational space. The set of equations that defines the estimation problem is known as a state-space model or a hidden Markov model. These are

 xk=M(xk−1,\boldmathηk), (1)
 yk=H(xk,\boldmathνk), (2)

where is the state at time , are the observations, is the random model error, and is the observational error. The method developed here is general and does not rely on the additive or Gaussian assumption in model errors.

Observations are assumed to be measured at discrete times. Some apriori knowledge of the state at is assumed, the initial prior state density . The sequential Bayesian state inference is given in two stages:

1. Firstly, in the evolution stage the prediction density is determined as

 p(xk|y1:k−1)=∫p(xk|xk−1)p(xk−1|y1:k−1)dxk−1, (3)

where we denote as . At , so that the prediction density is .

2. Secondly, in the assimilation stage Bayes rule is used to express the inference as a sequential process

 p(xk|y1:k)=p(yk|xk)p(xk|y1:k−1)p(yk|y1:k−1) (4)

where is the observation likelihood and is the marginalized likelihood.

### 1.2 Particle flows and optimal transport

The concept of homotopy can be used to transform the prior density towards the posterior density. Continuous deformations through a parameter can be used to achieve this, such as , . (To simplify the notation we have left the conditioning to observations and time index implicit in this and the following subsection.) The parameter is interpreted as a pseudo-time which is varied from 0 to 1 at a fixed real time. This concept was used in Daum and Huang (2007)

for particle filters. The densities are represented through particles, so that the deformations are represented by particles moving in a flow according to a set of ordinary differential equations,

 dxλdλ=vλ(xλ), (5)

where is the drift or velocity field. We focus on deterministic flows so that diffusivity processes are not considered. In some recent works, the particles are pushed smoothly forward in pseudo-times using a Gaussian approximation to represent the velocity field (e.g. Bunch and Godsill (2016); Li and Coates (2017)).

Another promising approach for the sampling of complex posterior distributions is the optimal transport problem (Marzouk et al, 2017)

. Given the prior probability mass distribution

and the target probability mass distribution , we want to transport the probability mass from to using a mapping . The optimal transport problem seeks the transformation that gives the minimal cost to transport the distribution of mass from to . This represents the classical Monge optimal transport problem. There is a rigorous proof of the existence of such optimal mapping under mild conditions (McCann, 1995). Furthermore, (Angenent et al, 2003) have formulated the Monge-Kantorovich problem as a steepest descent gradient optimization.

Our approach combines the ideas of optimal transport and particle flows, the so-called local approach to optimal transport (Villani, 2008). Through a sequence of mappings we seek to push forward the particles from the prior to the target density. As in particle flows, the sequence of mappings are required to be as smooth as possible. The particles behave as active Lagrangian tracers in a flow. The velocity field at each pseudo-time step of the mapping sequence is chosen following a local optimal transport criterion.

### 1.3 Variational mappings

At each iteration of the sequence, we propose a local transformation that follows (5),

 xλ+ϵ=T(xλ)=xλ+ϵv(xλ), (6)

where is assumed to be small and is an arbitrary vectorial function “sufficiently” smooth, which represents the velocity of the flow defined in (5).

The Kullback-Leibler divergence is used as a measure of the difference between the intermediate density and the posterior density . The Kullback-Leibler divergence between and after the transformation, , in terms of the inverse mapping is

 KLT=∫qX(x)log(qX(x)pT−1(x))dx, (7)

where and is the determinant Jacobian of the mapping and for brevity .

Our goal is to find the velocity field of the transformation that gives the minimum of for each mapping of the sequence. In other words, we need to find the direction , that gives the steepest descent of .

We use the Gateaux derivative of a functional, which is a generalization of the directional derivative. Given the functional , it is defined as

 DhF=limϵ→0F(x+ϵh(x))−F(x)ϵ=dϵF(x+ϵh)|ϵ=0. (8)

The Gateaux derivative of the Kullback-Leibler divergence, (7), in the direction is given by

 DvKL=−∫q(x)dϵlogpT−1(x)|ϵ=0 dx. (9)

The derivative of the transformed log-posterior density is

 dϵlogpT−1(x)=∇Tlogp(T(x))⊤dϵT+Tr(∇xT−1dϵ∇xT)|ϵ=0. (10)

Considering that in (10) and replacing in (9), the directional derivative of results in

 DvKL=−∫q(x)[∇xlogp(x)⊤v(x)+Tr(∇xv)] dx. (11)

Equation 11 gives the derivative along . However, we require the negative gradient of in terms of the samples of for the optimization of as a function of . In general belongs to an infinite dimensional Hilbert space. Hence, the full optimization problem is still intractable in practice since we do not have a way to determine the that gives the steepest descent direction.

One way to limit our functional optimization problem so that it becomes tractable is choosing as space of functions the unit ball of a reproducing kernel Hilbert space (RKHS), which we denote as . This was proposed by Liu and Wang (2016). In this way, we constrain to and require that to find the gradient of . The optimization problem is then to find the that gives the direction of steepest descent of . The main properties of the RKHS are included in the SI Appendix.

Given a vector-valued kernel function,

, any function from can be expressed as the dot product by the kernel that defines the RKHS,

 v(x)=⟨K(⋅,x),v(⋅)⟩F. (12)

Equation 12 is known as the reproducing property (see SI Appendix SI.B). The scalar kernel defines .

Using (12) in (11) and dot product properties, we find:

 DvKL=⟨−∫q(x)[K(x,⋅)∇xlogp(x)+∇xK(x,⋅)] dx,v(⋅)⟩F1. (13)

This is valid for any such that . Therefore, the first term of the dot product in (13) is by definition the gradient of at ,

 ∇KL(x)=−Ex′∼q[K(x′,x)∇xlogp(x′)+∇xK(x′,x)]. (14)

This expression for the gradient, (14), is particularly suitable for Monte-Carlo integration when is only known through a set of samples. Since we seek to minimize the Kullback-Leibler divergence we choose as direction of the transformation the negative of its gradient, the steepest descent direction,

 xλ+ϵ=x−ϵ∇KL(x). (15)

Using Kernel reproducing property and integration by parts in (14), the resulting gradient flow of the mapping is

 vλ(x)=qλ(x)∇logp(x)−∇qλ(x). (16)

In Moselhy and Marzouk (2012), the Kullback-Leibler divergence is also used but with an extra regularization term to find a single global optimal map. Here, we apply a gradient flow via a sequence of local mappings. Under smoothness constrains(Villani, 2008), the minimum of the cost function as a function of the mappings, (6), is uniquely determined in (15), i.e. no regularization term is required. As shown in Tabak and Vanden-Eijnden (2010), the gradient flow, (16), has as stationary solution so that the gradient flow converges toward the target density.

### 1.4 Mapping particle filter

Consider we have a set of equal-weight particles which sample the posterior density at time . The target density at time which we aim to sample using the variational mapping is the posterior density . The mapping approach only requires a set of samples of the prior density. It is started by the set of unweighted particles that are evolved from the previous estimate to the present assimilation time by the dynamical model, i.e. , where the second subscript represents the mapping iteration. These samples are then pushed towards the sequential posterior density by the mapping iterations.

Given the set of particles that are samples of the intermediate density at mapping iteration , the gradient of the Kullback-Leibler divergence from (14) at a state space position by Monte-Carlo integration is

 ∇KL(x)=−1NpNp∑l=1 [K(x(l)k,i−1,x)∇logp(x(l)k,i−1) +∇xK(x(l)k,i−1,x)]. (17)

The negative of this gradient represents the velocity field of the gradient flow. An interpretation of the two RHS terms in (17) is given in SI Appendix (SI.C). At the mapping iteration , the particle is transformed, according to the mapping (15), by

 x(j)k,i=x(j)k,i−1−ϵ∇KL(x(j)k,i−1), (18)

where is the steepest descent direction at the particle position. (18) may be interpreted as the movement of the particles along the streamlines of the flow assuming small .

An ingredient needed to evaluate (17) is an expression for the gradient of the posterior density. A problem in sequential Bayesian inference is that there is no exact expression for the posterior density. We do know the likelihood function, but we only have a set of particles that represent the prior density, not the density itself. The prior density using the particle representation of the posterior density at time in (3) results in

 p(xk|y1:k−1)≈1NpNp∑jp(xk|x(j)k−1), (19)

and the expression for the posterior density from (4) becomes

 p(xk|y1:k)∝1Npp(yk|xk)Np∑jp(xk|x(j)k−1). (20)

Because of the finite ensemble size at time , this expression is an approximation of the posterior density. Equation 20 is the target density in the mapping.

At each mapping iteration, all the particles are moved following (18). Successive applications of the transformation, through gradient descent, with the corresponding updates of , will converge towards the minimum of the Kullback-Leibler divergence. The pseudo-code of the implemented algorithm which combines variational mapping with the sequential particle filter is shown in Algorithm 1.

The form in which the variational mapping is obtained, as steepest gradient descent, is suitable for the stochastic optimization algorithms used in the machine learning community. In this case,

, known as the learning rate, can be determined adaptively using from previous iterations (e.g. Zeiler (2012)).

An important part of any iterative variational method is a stopping criterion. One possibility is to check the value of

for each particle, or averaged over all particles. Another option is to use importance sampling and to interpret the final density of the mapping sequence as a proposal density and calculate weights with respect to the posterior density, which will then automatically lead to an unbiased estimator. To use this, an expression for the final proposal density

is required, however, we have only its particle representation. Two possibilities exist to implement this in a practical way. If the dimension of the system is small one can use kernel-density estimation based on the particles. The other approach is to use the transformations, that we do have explicitly, and relate the final density to the prior density. The approaches are detailed in the SI Appendix (SI.D).

To avoid the resampling in the importance sampling step, one can calculate the effective ensemble size as , and decide to continue the iterations in Algorithm 1 until reaches a certain threshold . In the following assimilation cycle, the weights of the particles representing the final proposal density has to be considered in the sequential posterior density, resulting in a weighted posterior density instead of (20). One should keep in mind, however, that our estimate of the posterior density is rather poor in high-dimensional systems with a small number of particles, so the weights would not be very accurate.

## 2 Numerical experiments

The implementation of the mapping particle filter is evaluated using three chaotic nonlinear dynamical models with state spaces of up to 40 dimensions. Higher dimensional state spaces will be pursued in a followup work. We conducted a set of stochastic twin experiments. In these experiments, the dynamical model to produce the synthetic observations is the same as the one used in the state-space model, including the statistical parameters.

A Gaussian kernel is chosen for the experiments,

 K(x,x′)=E[−12(x−x′)⊤A−1(x−x′)]. (21)

As a first reasonable choice, is taken proportional to the model error covariance matrix, , i.e. . This appears to be a convenient choice since the model uncertainty, , already includes the physics so that it is expected to represent the scaling between the variables and also the correlations between variables. Under this choice the only parameter of the kernel that requires to be defined is . With growing dimension of the state vector and the number of particles small, the distance between the particles grows larger. To account for this, should be chosen larger to increase the distance between particles. In the experiments, we have found that should be chosen of the order of the dimension of the state vector. Further details on the experiments are given in SI Appendix. Figure 1: Marginalized prior density determined at the 115-th time iteration, k=115 (blue line), the final density after the variational mapping (red line) and the marginalized sequential posterior “target” distribution (black). Figure 2: Marginalized posterior density (contours) and the particles sampling the prior density at the 115-th time iteration for the three sections at the mode location (upper panels). The particles sampling the final density after the variational mapping with 50 iterations (lower panels).

A quick convergence of the prior to the posterior density in terms of the mapping iterations is the key ingredient for high-dimensional applications. The applications we have in mind satisfy

. Therefore, we evaluated, apart from stochastic gradient descent with a fixed learning rate

, adadelta (Zeiler, 2012) and adam (Kingma and Ba, 2015) which have an adaptive learning rate for each state variable (see SI Appendix, SI.F). These stochastic optimization methods are particularly suitable for high-dimensional spaces.

The maximum number of mapping iterations is fixed to , and a threshold for the number of effective particles is used as criterion of convergence, . As mentioned, a threshold in the absolute value of the gradient could also be used as criterion of convergence. In SI Appendix, SI.F we compare these two criteria for convergence of the variational mapping.

## 3 Results

Figure 1 shows the marginalized prior density as a function of the three state variables of the Lorenz-63 system determined at the 115-th time assimilation cycle, (each panel exhibits a variable). For plotting, the marginalized density is determined with kernel density estimation. The cycle was chosen so that the differences between the mapping particle filter (MPF) and the sampling importance resampling (SIR) filter are emphasized (i.e. a cycle for which the apriori density is not a good representation of the posterior density). The experiment uses 100 particles and the full state is observed. Because particles are spread in and out the high observation likelihood region, the proposal density is broader, asymmetric compared with the marginalized sequential posterior density. After the variational mapping, the resulting density is a close representation of the sequential posterior density, the target density for the mapping, in the three variables (see the three panels in Fig. 1).

The distribution of the particles can be seen in Fig. 2. Upper panels of Fig. 2 show the particles belonging to the prior density for each variable, the initial density for the mapping. Contours show the posterior density, marginalised to the 2-dimensional plane. Lower panels show the final locations of the samples after 50 mapping iterations. The mapping has pushed most particles towards the high probability region of the posterior density. Note that the algorithm is not collapsing the particles toward the mode of the posterior, but representing the density with the samples as a whole. The diversity of the particles produced by the mapping is notable. Figure 3: As in Fig. 2 for the SIR particle filter.

The same experiment conducted with the standard SIR, or bootstrap filter using resampling when is shown in Fig. 3. The sampling deficiencies in the SIR filter are visible in Fig. 3 where the particles of the prior density (upper panels) and the resampled density (lower panels) are shown with dots. Because the resampling conserves and replicates statistically only the particles with high likelihood, there is an evident sample impoverishment even in this low-dimensional twin experiment. Figure 4: RMSE for the MPF and the SIR filters as a function of the number of particles. Panels (a) 100, (b) 20, and (c) 5 particles.

Because of the good sampling, the MPF exhibits a very weak sensitivity to the number of particles in the root-mean-square error (RMSE) measure. Figure 4 shows the RMSE of the analysis mean with respect to the true state, for MPF and SIR filters for 100, 20 and 5 particles (Fig. 4 a, b and c respectively). Even for a large number of particles (with respect to the state dimension), the MPF outperforms the mean estimation with respect to the SIR filter. SIR filter diverges for 5 particles. Figure 4 a, b and c shows that the time-mean RMSE for the MPF practically does not change between 100 particles with a time-mean RMSE of 0.482 to 5 particles whose time-mean RMSE is 0.489. Thus, the performance of the MPF is quite robust, only small changes in the RMSE are found when decreasing the number of particles.

Note the complexity of Algorithm 1 is proportional to while it is proportional to for the SIR filter, so that the computational cost of the assimilation stage is higher for the MPF with the same number of particles. However, a smaller number of particles is required so that it represents a promising venue for expensive dynamical models.

To evaluate the performance of the MPF for a larger state-space model, an experiment using the 40-variable Lorenz-96 dynamical system was conducted. An ensemble of 20 particles was used in the experiment. Scalable sampling methods are expected to produce useful results for these experiments in which . Figure 5a compares the resulting RMSE of the state given by MPF and the stochastic EnKF without localization as a function of time. The SIR filter is not shown since it exhibits degeneracy in this experiment because of the high dimensional state and observation spaces. The MPF takes a longer time to converge (20 cycles), but the EnKF exhibits a larger RMSE (this result might be improved using more members or localization, although the Gaussian assumption will hit the EnKF results). Furthermore, the mapping particle filter estimates are rather stable in time, which is a result of its deterministic mapping in which the particles feel each other. The spread is rather stable and comparable to the RMSE in both filters (Fig. 5b). Figure 5: RMSE (a) and spread (b) of the state as a function of the assimilation cycles produced by EnKF and MPF in the 40-variables Lorenz-96 system.

A third set of experiments was conducted using a compartment stochastic dynamical model for cholera dynamics (Ionides et al, 2006). This is a 5-variable system allowing for susceptible, infected, and three classes of recovery individuals in which cholera mortality is the only observed variable. From the inference point of view, it presents some challenges(Ionides et al, 2006). It has partial noisy observations whose variance depends on time. Transmission is assumed stochastic and has a multiplicative model error. Since our implementation of the MPF assumes additive model error, we have augmented the state space with a new variable with an evolution equation equal to the noise factor and in this way the augmented cholera model can be represented by additive model error terms. The ensemble is composed of 100 particles and a small artificial additive model error term was added to benefit particle diversity (Liu and West, 2001). Note that a non-Gaussian density for additive model error would give more realistic features. Figure 6a shows the true mortality time series and the one estimated by the MPF. The experiment started with an initial mean state which underestimates the number of true susceptible individuals by 20% . A good performance is found for the MPF (total RMSE is 0.0031). The SIR filter follows the cholera outbreaks but the estimate is much more noisier when the same level of noise is used as for the MPF (total RMSE is 0.0039). By decreasing the level of noise in the SIR filter, the noise diminishes but at the expense of a strong underestimation of the peaks (total RMSE is 0.0077). Figure 6: Mortality time series for a model of cholera dynamics, true mortality and the one estimated with the MPF (a), with the SIR filter with high additive noise (b) and with the SIR filter with low additive noise.

## 4 Discussion

The proposed variational mapping particle filter is based on a deterministic gradient flow. It is able to keep a large effective sample size, avoiding the resampling step, even for long recursions subject to the convergence of the mapping sequences. Furthermore, it shows a robust behavior in terms of the RMSE with a consistent ensemble spread in the conducted experiments, using only a small number of particles.

In the limit of a single particle, , the gradient of the Kullback-Leibler divergence is equal to the negative of the gradient of the log-posterior density (see (17)). In that case, the optimization via gradient descent determines the mode of the posterior density. Therefore, the method for is equivalent to three-dimensional variational data assimilation (3D-Var) with in the role of .

The mapping of samples in the variational mapping particle filter is produced by moving the particles where they maximize the information gain of the proposal density with respect to the sequential posterior density, via the Kullback-Leibler divergence. In these terms, the proposed mapping particle filter seeks to maximize the amount of information available in a complex high-dimensional posterior density using a stochastic optimization method and given a limited number of particles.

Acknowledgements. We are grateful to the reviewers for valuable comments. This work has been founded by CUNDA project of the European Research Council. A python library with the mapping particle filter is available at http://github.com/pulidom/mpf/.

## References

• Angenent et al (2003) Angenent, S., Haker, S. and Tannenbaum, A. (2003) Minimizing flows for the Monge–Kantorovich problem. SIAM journal on mathematical analysis, 35, 61-97.
• Arulampalam et al (2002) Arulampalam, M.S., Maskell, S., Gordon, N. and Clapp, T. (2002) A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on signal processing, 50, 174-188.
• Atkins et al (2013) Atkins, E., Morzfeld, M. and Chorin, A.J., (2013) Implicit particle methods and their connection with variational data assimilation. Mon. Wea. Rev., 141, 1786-1803.
• Bunch and Godsill (2016) Bunch, P. and Godsill, S., (2016) Approximations of the optimal importance density using Gaussian particle flow importance sampling. Journal of the American Statistical Association, 111, 748-762.
• Cappe et al (2005) Cappé O, Moulines E, Rydon T. (2005) Inference in Hidden Markov models. Springer Science+Business Media: New York, NY.
• Chen and Reich (2015) Cheng, Y. and Reich, S., (2015) Assimilating data into scientific models: An optimal coupling perspective. In Nonlinear Data Assimilation, 75-118. Springer.
• Chorin and Tu (2009) Chorin, A.J. and Tu, X., (2009) Implicit sampling for particle filters. Proc. Nat. Acad. Sci. USA, 106, 17249-17254.
• Daum and Huang (2007)

Daum, F. and Huang, J., (2007) Nonlinear filters with log-homotopy.

In Signal and Data Processing of Small Targets 2007. 6699, p. 669918.
• Doucet et al (2000) Doucet, A., Godsill, S. and Andrieu, C. (2000) On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and computing, 10, 197-208.
• Ionides et al (2006) Ionides, E.L., Bretó, C. and King, A.A. (2006) Inference for nonlinear dynamical systems. Proc. Nat. Acad. Sci. USA, 103, 18438-18443.
• Kingma and Ba (2015) Kingma, D. and Ba, J. (2015) Adam: A method for stochastic optimization. In Int. Conf. on Learning Repres. (ICLR) arXiv preprint arXiv:1412.6980.
• LeCun et al (2015)

LeCun, Y., Bengio, Y. and Hinton, G. (2015) Deep learning.

Nature, 521, 436-444.
• Li and Coates (2017) Li, Y. and Coates, M., (2017) Particle filtering with invertible particle flow. IEEE Transactions on Signal Processing, 65, 4102-4116.
• Liu and Wang (2016) Liu, Q. and Wang, D. (2016) Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, 2378-2386.
• Liu and West (2001) Liu, J. and West, M., (2001) Combined parameter and state estimation in simulation-based filtering. In Sequential Monte Carlo methods in practice, 197-223. Springer, New York.
• Marzouk et al (2017) Marzouk Y, T Moselhy, M Parno, A Spantini (2017) An introduction to sampling via measure transport. To appear in Handbook of Uncertainty Quantification; R. Ghanem, D. Higdon, and H. Owhadi, editors; Springer. arXiv:1602.05023
• McCann (1995) McCann, R. (1995) Existence and uniqueness of monotone measure-preserving maps. Duke Mathematical Journal 80, 309–323.
• Moselhy and Marzouk (2012) Moselhy TA, Marzouk YM (2012). Bayesian inference with optimal maps. J. Comp. Phys.. 231, 7815-7850.
• Pulido and Rosso (2017) Pulido M. and O. Rosso (2017) Model selection: Using information measures from ordinal symbolic analysis to select model sub-grid scale parameterizations. J. Atmos. Sci., 74, 3253–3269 doi: 10.1175/JAS-D-16-0340.1
• Pulido et al. (2018) Pulido M., P. Tandeo, M. Bocquet, A. Carrassi and M. Lucini, (2018) Parameter estimation in stochastic multi-scale dynamical systems using maximum likelihood methods. Tellus, 70, 1442099.
• Reich (2013) Reich, S. (2013) A nonparametric ensemble transform method for Bayesian inference. SIAM Journal on Scientific Computing, 35, A2013-A2024.
• Scholkopf and Smola (2002) Scholkopf, B. and Smola, A.J. (2002)

Learning with kernels: support vector machines, regularization, optimization, and beyond.

MIT press.
• Tabak and Vanden-Eijnden (2010) Tabak, E. G. and Vanden-Eijnden, E. (2010) Density estimation by dual ascent of the log-likelihood. Comm. Math. Sci., 8, 217-233.
• vanLeeuwen (2010) vanLeeuwen P.J. (2010) Nonlinear data assimilation in geosciences: an extremely efficient particle filter. Quart. J. Roy. Met Soc.,136, 1991-1999.
• vanLeeuwen (2015) vanLeeuwen, P.J. (2015) Nonlinear Data Assimilation for high-dimensional systems. In Nonlinear Data Assimilation, 1-73. Springer.
• Villani (2008) Villani, C. (2008) Optimal transport: old and new. Vol. 338. Springer Science & Business Media.
• Zhu et al (2016) Zhu, M., Van Leeuwen, P.J. and Amezcua, J. (2016) Implicit equal weights particle filter. Quart. J. Roy. Met. Soc., 142, 1904-1919.

## Si Supporting Information

### si.a Hidden Markov models with Gaussian errors

Next, we derive the expression of the posterior density in the sequential framework as a function of the particles at . For the numerical experiments, model and observational errors are assumed Gaussian. These are taken as an example, but the technique framework is general. In particular, it is suitable for non-additive and non-Gaussian errors. The prediction probability density can be obtained analytically under the additive Gaussian model error assumption, in (1). The evolution of the particles from (3) gives

 p(xk|y1:k−1)∝Np∑m=1E[−12∥xk−M(x(m)k−1)∥2Qk], (S1)

where . We assume equal-weighted particles here; it is straightforward to assume weighted particles at time if so desired.

The observational error is also assumed additive Gaussian, in (2), so that the observational likelihood at is

 (S2)

The Bayes rule, (4), is used to obtain the sequential posterior density combining the prediction density, (S1), and the observation likelihood, (S2). The sequential posterior density given the particles at the previous time step is proportional to

 p(xk|y1:k)∝Np∑m=1E[−12∥∥xk−M(x(m)k−1)∥∥2Qk]⋅E[−12∥∥yk−H(xk)∥∥2Rk]. (S3)

This is the target density that the particles as Monte Carlo samples should be representing in the filter.

The gradient of the logarithm of the sequential posterior density, (S3), at is given by

 ∇xlogp(x(l)k,i−1)=Cp−1(x(l)k,i−1)Np∑m=1ψQl,mψRl{H⊤R−1k[yk−H(x(l)k,i−1)]−Q−1k[x(l)k,i−1−M(x(m)k−1)]} (S4)

with ,
and . Reducing we obtain

 ∇xlogp(x(l)k,i−1)=H⊤R−1k(yk−H(x(l)k,i−1))−Q−1k⎡⎢⎣x(l)k,i−1−∑Npm=1ψQl,mM(x(m)k−1)∑Npm=1ψQl,m⎤⎥⎦. (S5)

This gradient of the log-posterior, (S5), does not depend on the unknown normalization constant of the sequential posterior density. Hence, the numerical evaluation of the gradient of the Kullback-Leibler divergence is feasible.

### si.b Reproducing kernel Hilbert space

Kernel functions are used in machine learning for finding regressions in datasets using an implicit feature space (Scholkopf and Smola, 2002). Instead of determining the data representation in that space, only the inner products of the images of two data points in the feature space are required. Then, the inner products in the image space are replaced by kernels in the feature space, the so-called kernel trick.

Some of the properties of the kernel are:

1. Let be a positive definite kernel. Mercer’s theorem says that the spectral decomposition

 K(x,x′)=∑λjej(x)ej(x′) (S6)

is given by a uniformly convergent series of the eigenfunctions

. The eigenfunctions satisfy

and the eigenvalues are ordered, the larger the first.

2. Let be a RKHS, and , two functions of such that , . The inner product of the functions in is given by

 ⟨f,g⟩F1=∑jfjgj/λj. (S7)
3. satisfies the reproducing property,

 f(x)=⟨f,K(⋅,x)⟩F1,K(x,x′)=⟨K(x,⋅),K(x′,⋅)⟩F1. (S8)
4. Every positive definite defines a unique RKHS.

If where and , then the inner product in is .

### si.c Interpretation of the gradient of the Kullback-Leibler divergence

The first term of the gradient of the Kullback-Leibler divergence, (17), tends to drive the particles towards the peaks of the posterior density. It gives a weighted average of including the current particle and surrounding ones within the kernel scale of the current one. This is the typical behavior of a variational importance sampler so that it accumulates particles at the high probability regions of the posterior density. On the other hand, the second term of in (17) tends to separate the particles. If the current particle is in the influence region of another particle say , i.e. within the kernel scale, the term will tend to separate them acting as a repulsive force between particles.

As an example, suppose the radial basis functions as kernels,

, the gradient of the kernel is

 ∇K(x,x′)=−1h(x−x′)K(x,x′). (S9)

Suppose particle is into consideration. Particles that are far away, , do not contribute to the two terms in (17). On the other hand, if there is an -th particle that is within the kernel scale of , the -th particle will be driven by a force from the second term in (17) given by,

 x(j)k,i=x(j)k,i−1+γ(x(j)k,i−1−x(l)k,i−1), (S10)

where so that the force given by this term is repulsive. It tends to separate the particles that are within the kernel scale.

### si.d Importance sampling and the variational mapping particle filter

The variational mapping sampling scheme is suitable to be coupled with importance sampling. The last intermediate density obtained with the variational mapping sampling scheme can be used as a proposal density of the posterior density. Suppose we denote as to the particles obtained in the last mapping iteration. The values of the proposal and posterior densities at the particle positions are used to estimate the weights

 ~w(j)k=p(yk|x(j)k)p(x(j)k|y1:k−1)q(x(j)k), (S11)

and then they are normalized with

Since the output of the filter is a set of weighted samples, the sequential posterior density (e.g. (S3)) has to include the weights of the previous cycle,

 p(x(j)k|y1:k) ∝p(yk|x(j)k)∑lw(l)k−1p(x(j)k|x(l)k−1). (S12)

To compute the weights between the posterior density and the proposal density –final density of the MPF–, we also need to evaluate the proposal density at the particle locations. We require the proposal density to be known (apart from the set of samples that we have from that density). There are two ways to obtain it. One way is to use kernel density estimation with the set of samples, , using the same kernels as in the mapping. The second option is to determine at each mapping iteration how the density is transformed with the mapping (e.g. Moselhy and Marzouk (2012)).

Before the mapping, we choose as initial intermediate density, for instance, an equally weighted Gaussian mixture centered at and with covariances ,

 q(xk,0|y1:k) (S13)

In the following mapping iterations , the density changes according to the mapping ,

 qT(x(j)k,i)=q(x(j)k,i−1)det∇xT=q(x(j)k,i−1)|I−ϵH|, (S14)

where is the determinant Jacobian of the transformation and is the Hessian of the Kullback-Leibler divergence. From (17), the Hessian is given by

 H = 1NpNp∑l=1{∇x(j)k,i−1K(x(l)k,i−1,x(j)k,i−1)[∇xlogp(x(l)k,i−1)]⊤ (S15) + ∇x(j)k,i−1∇x(l)k,i−1K(x(l)k,i−1,x(j)k,i−1)}.

The calculation of the Hessian of the Kullback-Leibler divergence, (S15) is computationally demanding. Although note that the second derivatives are well known for standard kernel functions and the symmetry of the kernel can be used to reduce the calculations. However, the Hessian calculation is required at each mapping iteration to evolve from the previous intermediate density .

The weights account for the bias introduced in the filter when the set of samples obtained by the sequence of mappings is not exactly distributed according to the target density. However, it should be kept in mind that our estimate of the posterior density (S12) is not exact, so it is unclear what the calculated weights actually mean. It would be interesting to determine the accuracy of this estimate, but that is beyond the scope of this work.

The weights are also useful as a diagnostic of the quality of the convergence of the optimization method. In that case, an evaluation of the weights for the current intermediate density is conducted in each mapping iteration. If the given weights give an effective sample size which is over the threshold no further mapping iterations are required. In the conducted experiments of this work, in general, the effective sample size was over 98% after 50 mapping iterations during the whole recursion. In other words, if the number of mapping iterations is large enough , the weights will be almost equal. In high-dimensional systems, a convergence criterion based on the module of the gradient of Kulback-Leibler divergence can be used. This avoids the calculations of the Hessian or kernel density estimations. The convergence of the technique is evaluated through experiments with both measures in Section SI.F.

### si.e Experiment details

The Lorenz-63 system is given by

 dxdt=σ(y−x), dydt=x(ρ−z)−y, (S16) dzdt=xy−βz.

The equations are solved using a fourth-order Runge-Kutta scheme. The parameters are chosen at their standard values , and . The integration time step is and the assimilation cycles are every (10 integration time steps). The observation operator

is taken as the identity matrix, for the full-observed state experiments in Section

3. Further experiments are conducted in Section SI.H using partial observations. The model error covariance is chosen diagonal. Its diagonal elements are set to 30% of the climatological variance of the variable, e.g. , where is the time-series variance of variable . The observation error covariance is . The Lorenz-63 experiments use importance sampling to evaluate the bias introduced by the sampling. The weights are evaluated using kernel density estimation with radial basis functions in the state space in coherence to the ones used for the mapping (same Gram matrix).

For the 40-variable Lorenz-96 system experiments, the set of equations is

 dxidt=(xi+1−xi−2)xi−1−xi+F, (S17)

where and cyclic boundary conditions are imposed, , and . The equations were integrated using a fourth-order Runge–Kutta scheme, with an integration step of 0.001. The forcing is which results in chaotic dynamics. The observational time resolution is . The observational error covariance is . The model error covariance is . This amplitude of the model error noise is representative for the two-scale Lorenz 96 system which was estimated to be 0.3 using information measures by Pulido and Rosso (2017)

and 0.3-05 using an Expectation-Maximization algorithm coupled to an ETKF

Pulido et al. (2018). The ensemble is composed of 20 particles. The initial ensemble particles are states taken randomly from a climatology. Because of the large dimension, we do not use the effective number of particles via kernel density estimation as in the Lorenz-63 experiments to evaluate convergence of the optimization, but the module of the gradient of the Kullback-Leibler divergence.

The compartment stochastic dynamical model for cholera is the one throughly described in (Ionides et al, 2006)

. A short description is given here. The individuals are classified as susceptible (

), infected individuals (), while recovered individuals belong to three classes () to allow for different inmune periods. The equations for the number of individuals for each category are given by

 dSt = dNBSt−dNNIt−dNSDt+dNRkSt dIt = dNSIt−dNIR1t−dNICt+dNIDt dR1t = dNIR1t−dNR1R2t−dNR1Dt (S18) dR2t = dNR1R2t−dNR2St−dNR2Dt dR3t = dNR2R3t−dNR3St−dNR3Dt

Transmission is given by a stochastic differential equation

 dNSIt=λtStdt+ϵItSt/PtdWt, (S19)

where

is a Gaussian white noise. The transitions between categories are given by

 dNIR1t=γItdt, dNRl−1Rlt=rkRl−1tdt, dNRlSt=rkRktdt, dNSDt=mStdt, dNIDt=mItdt, dNRlDt=mRltdt, dNICt=mcItdt, dNBSt=dPt+mPtdt,

where , represents birth, is cholera mortality and denotes death from other causes. Equations S18 are integrated using an Euler-Maruyama scheme with time step of  month. The observations are cholera mortality data which are given by

 yt∼N[NICt−NICt−1,τ2(NICt−NICt−1)2]. (S21)

While the mapping particle filter can handle non-additive model errors we decided to transform the system of equations above to allow for an additive model error implementation. We use an augmented state space defining a new state variable as , so that the augmented system has additive model error whose the gradient of the log-posterior density is represented by (S3). To add diversity to the particles in the filter a small additive noise term is added in all the equations. The variance of this additive noise is set to 10% of the variance. This noise is not used to generate the true trajectory, it is only included in the particle filter.

### si.f The stochastic optimization methods and their convergence

An optimization of the Kullback-Leibler divergence is conducted in the mapping particle filter to push the prior particles to samples from the posterior density. The particles are driven in a purely deterministic transformation by the variational mapping from the prior density to the posterior. The intermediate densities are represented through the particles. Each particle is then moved along the direction of the steepest descent of the Kullback-Leibler divergence, which considers all the particle positions.

Several stochastic gradient-based optimization methods have been recently developed in the machine learning community (Zeiler, 2012; Kingma and Ba, 2015) which are directly applicable to the mapping particle filter. They are focused on efficient convergence in high-dimensional control spaces when the cost function is noisy. The source of this noise is subsampling. These optimization methods, and their succesful convergence rates, have been instrumental for the success of deep learning applications (e.g. LeCun et al 2015). The variational mapping in this work is based on stochastic gradient-based optimization methods.

The stochastic gradient descent method has a fixed learning rate for all the iterations and all the control variables. This is well-known to cause some inconveniences, in particular along extended valleys where zig-zag convergence arises. As the gradient decreases in some directions, the convergence along those variables is very slow. The adaptative schemes search for a dynamic learning rate for each control variable. Ideally, a Newton method would be the optimal learning rate if the optimization problem is quadratic, but it requires Hessian evaluations which are computationally prohibitive in high-dimensional spaces. The learning rate for each control variable in the stochastic optimization method, adadelta (Zeiler, 2012), is based in a global learning rate over a running average of the past absolute directional derivatives along the direction of the control variable. Therefore, the learning rate increases for directions with small directional derivatives. The method improves substantially but still some stagnation after the first iterations has been reported. To overcome this weakness, the algorithm for stochastic optimization adam(Kingma and Ba, 2015)

estimates the first and second moments of the gradients with bias correction. These two stochastic optimization algorithms, adadelta and adam, together with a descent gradient method were evaluated in the experiments. The chosen initial learning rate of the optimization methods is a tradeoff between convergence speed and smoothness flow assumptions. The recommended values exhibited a good performance, so that no manual tuning of the learning rate was required

(Zeiler, 2012). Figure S1: The root-mean-square of the KL gradient as a function of the mapping iteration for the descent gradient (blue line), adadelta (red line) and adam (green line) optimization algorithms (upper panel). This corresponds to the first assimilation cycle in which the three optimization algorithms share the same prior density. The number of effective particles as a function of the mapping iteration for Np=100 (lower panel). Figure S2: As Fig. S1 but for the 10-th time iteration, so that the prior density for the optimization algorithms is not the same.

Figure S2 shows the convergence at the 10th assimilation cycle of the optimization methods for evaluating the recursive effects. Notably, adadelta has the fastest convergence with the minimum root-mean-square gradient similar to the one obtained in the first iteration. On the other hand, adam converges slowly in the first mapping iterations and starts to increase the convergence rate after 20 iterations (an effect that is likely caused by the momentum equations in adam which require some iterations to “warm up”). Adadelta has reached the maximum number of effective particles in 50 iterations. Because of the fast convergence and because the effective particle numbers is the most relevant parameter for the particle filter we took adadelta with 50 iterations and an initial learning rate of 0.03 as the default setting for the experiments.

There is a clear correlation between the decrease of the root-mean-square of the KL gradient and the increase in the number of effective particles in Fig. S2 for the three stochastic optimization methods. This implies as mentioned that both measures could be used to determine the convergence in the optimization and the required number of iterations for a given threshold. Along this line, we note that the Kullback-Leibler divergence can be directly expressed in terms of the weights.

Using Monte Carlo integration of the Kullback-Leibler divergence,

 KL(q∥p)=1NpNp∑j=1log⎡⎣q(x(j)k)p(x(j)k|y1:k)⎤⎦, (S22)

from (S11), it results

 KL(q∥p)=−1NpNp∑j=1log(w(j)kNp). (S23)

As expected if the weigths are equally distributed in the particles, the Kullback-Leibler divergence is zero, then . In other words, as the number of effective particles tends to , the Kullback-Leibler divergence tends to zero, its minimum.

### si.g Experiments on sensibility of the MPF

The performance of the mapping particle filter can be evaluated through the weights of the particles. As a function of the mapping iterations of the filter, the sequence of mappings is expected to start with a set of particles with most of the particles with almost zero weights and a few particles with very large weights. Note from (S23) that particles with almost zero weights produce a large Kullback-Leibler divergence. Then, as the particles are pushed toward the posterior density, their weights will tend to be equally distributed for most of the particles. If the variational mappings work effectively, only a few particles should remain with low weights. In other words, the variance of the weights should be small for all the cycles. Figure S3a shows the evolution of the weights of all the particles as a function of time for and without threshold for the Lorenz-63 system. Most of the particle weights are concentrated around 0.05, between 0.04 and 0.07 ( particles). There are some cycles in which one particle obtains a low weight but in the following cycle the weight of that particle is again within the equally-weighted range. Note that we did not perform any resampling. The variance of the weights is around (Figure S3b). The number of effective particles is in general about 19 (from a total of 20 particles), but in a few cycles it can descend down to 16. A further experiment was conducted in which the maximum number of iterations was increased to 100. In that case, all the particles remain with weights larger than 0.04 during the whole time sequence and the peaks of variance of the weights found in Fig. S3b are not present (see Fig. S4). The number of effective particles is in this case over 18 for all the cycles Figure S3: (a) Evolution of the weights of all the particles for the Lorenz-63 system in a long time sequence for I=50 and without threshold. Because the weights of all the particles are plotted, the last particle plotted is the most visible one. (b) The variance of the weights as a function of time. (c) Number of effective particles. Figure S4: Idem Fig. S3 for I=100 and without threshold.

To evaluate the MPF without using importance sampling, we conducted a Lorenz-63 experiment without using weights in the particles after the mapping. As criterion of convergence we use averaged over all the particles. The MPF experiment uses 20 particles and is compared with a SIR filter using 20 particles and another with 100,000 particles. The last is expected to be close to the true posterior density. Figure S5 shows the marginalized posterior densities of the three experiments in each Lorenz-63 variable. Upper panels correspond to cycle 450th and lower panels to cycle 500th. The marginalized densities given by the MPF with 20 particles are remarkably close to the SIR experiment with 100,000 particles after a long sequence of cycles, without using resampling nor importance sampling. On the other hand, the SIR filter with 20 particles has large differences with the true posterior density not only in the mode but also the uncertainties are larger than the true marginalized posterior density in some directions and smaller in other ones. Because of the resampling step some tendency of the SIR with a small number of particles to generate a bimodal marginalized densities in some variables was also found. Figure S5: Marginalized posterior densities in each variable of the Lorenz-63 for a SIR with 100,000 particles (red line), SIR with 20 particles (green line) and MPF with 20 equally weighted particles (blue line) at the 450th cycle (upper panels) and 500th cycle (lower panels).

The radial basis functions used as kernel require to set the bandwidth, which in this work was chosen as . The parameter depends on the number of particles and the dimensions of the problem and requires some tuning. The optimal parameter will be the one that produces the sample spread equal to the spread of the sequential posterior density. The bandwidth parameter used in the Lorenz-63 experiment with 20 particles is . For the experiment with 100 particles we use . Figure S6 shows experiments for the 40 variables Lorenz-96 system in which we vary the bandwidth from to . The RMSE shows only slight changes with similar values in the three experiments (Figure S6a). The RMSE measure for the MPF appears rather robust to the kernel bandwidth. Figure S6b shows the spread of the experiments for different values. A choice of appears to give an optimal spread of the particles for this experiment. Figure S6: (a) RMSE for the MPF in the Lorenz-96 experiments using a bandwidth of α=2,20,100. (b) The corresponding spread of the particles of the posterior density.

### si.h Experiments using partial observations

An experiment to evaluate the performance of MPF for a partially observed Lorenz-96 system (40 variables and 20 observations) was conducted using 20 particles. Figure S7a shows the RMSE given by the MPF and the stochastic EnKF without localization. The MPF outperforms the EnKF in terms of the RMSE measure. Figure S7b shows the spread of the particles for both filters. Figure S7: RMSE (a) and spread (b) of the state as a function of the assimilation cycles produced by EnKF and MPF in the 40-variables Lorenz-96 system with 20 observations. Both filters use 20 particles.

Figure S8 shows the mean state in the three Lorenz-63 variables for the partial observation experiment. They are estimated by the particle filters using just 5 particles for the case in which only is observed and , . The MPF filter (upper panels) manifests a very good performance with the mean state estimate in the observed variable very close to the true value, while it is also capturing rather closely the behavior of the unobserved variables (Panels (b) and (c)). The information that is available in the observed variable is enough to recover through the prediction density correlations of the full hidden state. The SIR filter for the same conditions and same set of observations diverges (lower panels of Fig. S8), it finishes with all the weight in only one particle. Figure S8: Estimated mean state and the true evolution of the Lorenz-63 variables. The only observed variable is x, we use only 5 particles. Upper panels show the MPF experiment and lower panels SIR experiment. Time axis corresponds to number of assimilation cycles, the length of the assimilation cycle is Δt=0.01.

As a further experiment, a more challenging one, we evaluate the performance of the MPF for long assimilation windows and partial observations, the assimilation cycle is extended to . Note the integration step is kept as in all the experiments at so that there are 100 model time steps between observations. The number of particles is 20. In this case, we expect a much stronger effect of the nonlinear terms in the Lorenz-63 equations. Figure S9 shows the mean state estimates with the MPF (upper panels) and SIR (lower panels). While the SIR filter diverges, the MPF exhibits a very good performance considering the small number of particles and partial sporadic observations. Figure S9: Estimated mean state and the true evolution of the Lorenz-63 variables for a long assimilation window Δt=0.1. There is only one observed variable (x). A set of 20 particles is used in the filters.

A third partial observation experiment is evaluated in which only is observed. The experiment with only as observed variable does not contain all the dynamical information from observations to recover the full state, since the variable does not contain information about the wing of the attractor that the other two variables are in. We conducted an experiment with , and 20 particles. Upper panels of Fig. S10 show the results for the MPF experiment. As expected, the filter is unable to recover the state in and variables for some assimilation cycles. However, a good performance is found with the MPF, it appears to recover quickly after failing to detect the attractor of the system. On the other hand, the SIR filter (lower panels of Fig. S10) looses track and it is unable to recover the true trajectory for a large number of cycles. Figure S10: Estimated mean state and the true evolution of the Lorenz-63 variables. The only observed variable is z. A set of 20 particles is used in the MPF. Upper panels show the estimated mean state with the MPF and lower panels with the SIR filter.