1 Methodology
1.1 Sequential Bayesian estimation
We assume the estimation problem is encompassed of a dynamical model , which predicts the state from a previous state, and an observational model , which transforms the state from the (hidden) state space to the observational space. The set of equations that defines the estimation problem is known as a statespace model or a hidden Markov model. These are
(1) 
(2) 
where is the state at time , are the observations, is the random model error, and is the observational error. The method developed here is general and does not rely on the additive or Gaussian assumption in model errors.
Observations are assumed to be measured at discrete times. Some apriori knowledge of the state at is assumed, the initial prior state density . The sequential Bayesian state inference is given in two stages:

Firstly, in the evolution stage the prediction density is determined as
(3) where we denote as . At , so that the prediction density is .

Secondly, in the assimilation stage Bayes rule is used to express the inference as a sequential process
(4) where is the observation likelihood and is the marginalized likelihood.
1.2 Particle flows and optimal transport
The concept of homotopy can be used to transform the prior density towards the posterior density. Continuous deformations through a parameter can be used to achieve this, such as , . (To simplify the notation we have left the conditioning to observations and time index implicit in this and the following subsection.) The parameter is interpreted as a pseudotime which is varied from 0 to 1 at a fixed real time. This concept was used in Daum and Huang (2007)
for particle filters. The densities are represented through particles, so that the deformations are represented by particles moving in a flow according to a set of ordinary differential equations,
(5) 
where is the drift or velocity field. We focus on deterministic flows so that diffusivity processes are not considered. In some recent works, the particles are pushed smoothly forward in pseudotimes using a Gaussian approximation to represent the velocity field (e.g. Bunch and Godsill (2016); Li and Coates (2017)).
Another promising approach for the sampling of complex posterior distributions is the optimal transport problem (Marzouk et al, 2017)
. Given the prior probability mass distribution
and the target probability mass distribution , we want to transport the probability mass from to using a mapping . The optimal transport problem seeks the transformation that gives the minimal cost to transport the distribution of mass from to . This represents the classical Monge optimal transport problem. There is a rigorous proof of the existence of such optimal mapping under mild conditions (McCann, 1995). Furthermore, (Angenent et al, 2003) have formulated the MongeKantorovich problem as a steepest descent gradient optimization.Our approach combines the ideas of optimal transport and particle flows, the socalled local approach to optimal transport (Villani, 2008). Through a sequence of mappings we seek to push forward the particles from the prior to the target density. As in particle flows, the sequence of mappings are required to be as smooth as possible. The particles behave as active Lagrangian tracers in a flow. The velocity field at each pseudotime step of the mapping sequence is chosen following a local optimal transport criterion.
1.3 Variational mappings
At each iteration of the sequence, we propose a local transformation that follows (5),
(6) 
where is assumed to be small and is an arbitrary vectorial function “sufficiently” smooth, which represents the velocity of the flow defined in (5).
The KullbackLeibler divergence is used as a measure of the difference between the intermediate density and the posterior density . The KullbackLeibler divergence between and after the transformation, , in terms of the inverse mapping is
(7) 
where and is the determinant Jacobian of the mapping and for brevity .
Our goal is to find the velocity field of the transformation that gives the minimum of for each mapping of the sequence. In other words, we need to find the direction , that gives the steepest descent of .
We use the Gateaux derivative of a functional, which is a generalization of the directional derivative. Given the functional , it is defined as
(8) 
The Gateaux derivative of the KullbackLeibler divergence, (7), in the direction is given by
(9) 
The derivative of the transformed logposterior density is
(10) 
Considering that in (10) and replacing in (9), the directional derivative of results in
(11) 
Equation 11 gives the derivative along . However, we require the negative gradient of in terms of the samples of for the optimization of as a function of . In general belongs to an infinite dimensional Hilbert space. Hence, the full optimization problem is still intractable in practice since we do not have a way to determine the that gives the steepest descent direction.
One way to limit our functional optimization problem so that it becomes tractable is choosing as space of functions the unit ball of a reproducing kernel Hilbert space (RKHS), which we denote as . This was proposed by Liu and Wang (2016). In this way, we constrain to and require that to find the gradient of . The optimization problem is then to find the that gives the direction of steepest descent of . The main properties of the RKHS are included in the SI Appendix.
Given a vectorvalued kernel function,
, any function from can be expressed as the dot product by the kernel that defines the RKHS,(12) 
Equation 12 is known as the reproducing property (see SI Appendix SI.B). The scalar kernel defines .
Using (12) in (11) and dot product properties, we find:
(13) 
This is valid for any such that . Therefore, the first term of the dot product in (13) is by definition the gradient of at ,
(14) 
This expression for the gradient, (14), is particularly suitable for MonteCarlo integration when is only known through a set of samples. Since we seek to minimize the KullbackLeibler divergence we choose as direction of the transformation the negative of its gradient, the steepest descent direction,
(15) 
Using Kernel reproducing property and integration by parts in (14), the resulting gradient flow of the mapping is
(16) 
In Moselhy and Marzouk (2012), the KullbackLeibler divergence is also used but with an extra regularization term to find a single global optimal map. Here, we apply a gradient flow via a sequence of local mappings. Under smoothness constrains(Villani, 2008), the minimum of the cost function as a function of the mappings, (6), is uniquely determined in (15), i.e. no regularization term is required. As shown in Tabak and VandenEijnden (2010), the gradient flow, (16), has as stationary solution so that the gradient flow converges toward the target density.
1.4 Mapping particle filter
Consider we have a set of equalweight particles which sample the posterior density at time . The target density at time which we aim to sample using the variational mapping is the posterior density . The mapping approach only requires a set of samples of the prior density. It is started by the set of unweighted particles that are evolved from the previous estimate to the present assimilation time by the dynamical model, i.e. , where the second subscript represents the mapping iteration. These samples are then pushed towards the sequential posterior density by the mapping iterations.
Given the set of particles that are samples of the intermediate density at mapping iteration , the gradient of the KullbackLeibler divergence from (14) at a state space position by MonteCarlo integration is
(17) 
The negative of this gradient represents the velocity field of the gradient flow. An interpretation of the two RHS terms in (17) is given in SI Appendix (SI.C). At the mapping iteration , the particle is transformed, according to the mapping (15), by
(18) 
where is the steepest descent direction at the particle position. (18) may be interpreted as the movement of the particles along the streamlines of the flow assuming small .
An ingredient needed to evaluate (17) is an expression for the gradient of the posterior density. A problem in sequential Bayesian inference is that there is no exact expression for the posterior density. We do know the likelihood function, but we only have a set of particles that represent the prior density, not the density itself. The prior density using the particle representation of the posterior density at time in (3) results in
(19) 
and the expression for the posterior density from (4) becomes
(20) 
Because of the finite ensemble size at time , this expression is an approximation of the posterior density. Equation 20 is the target density in the mapping.
At each mapping iteration, all the particles are moved following (18). Successive applications of the transformation, through gradient descent, with the corresponding updates of , will converge towards the minimum of the KullbackLeibler divergence. The pseudocode of the implemented algorithm which combines variational mapping with the sequential particle filter is shown in Algorithm 1.
The form in which the variational mapping is obtained, as steepest gradient descent, is suitable for the stochastic optimization algorithms used in the machine learning community. In this case,
, known as the learning rate, can be determined adaptively using from previous iterations (e.g. Zeiler (2012)).An important part of any iterative variational method is a stopping criterion. One possibility is to check the value of
for each particle, or averaged over all particles. Another option is to use importance sampling and to interpret the final density of the mapping sequence as a proposal density and calculate weights with respect to the posterior density, which will then automatically lead to an unbiased estimator. To use this, an expression for the final proposal density
is required, however, we have only its particle representation. Two possibilities exist to implement this in a practical way. If the dimension of the system is small one can use kerneldensity estimation based on the particles. The other approach is to use the transformations, that we do have explicitly, and relate the final density to the prior density. The approaches are detailed in the SI Appendix (SI.D).
To avoid the resampling in the importance sampling step, one can calculate the effective ensemble size as , and decide to continue the iterations in Algorithm 1 until reaches a certain threshold . In the following assimilation cycle, the weights of the particles representing the final proposal density has to be considered in the sequential posterior density, resulting in a weighted posterior density instead of (20). One should keep in mind, however, that our estimate of the posterior density is rather poor in highdimensional systems with a small number of particles, so the weights would not be very accurate.
2 Numerical experiments
The implementation of the mapping particle filter is evaluated using three chaotic nonlinear dynamical models with state spaces of up to 40 dimensions. Higher dimensional state spaces will be pursued in a followup work. We conducted a set of stochastic twin experiments. In these experiments, the dynamical model to produce the synthetic observations is the same as the one used in the statespace model, including the statistical parameters.
A Gaussian kernel is chosen for the experiments,
(21) 
As a first reasonable choice, is taken proportional to the model error covariance matrix, , i.e. . This appears to be a convenient choice since the model uncertainty, , already includes the physics so that it is expected to represent the scaling between the variables and also the correlations between variables. Under this choice the only parameter of the kernel that requires to be defined is . With growing dimension of the state vector and the number of particles small, the distance between the particles grows larger. To account for this, should be chosen larger to increase the distance between particles. In the experiments, we have found that should be chosen of the order of the dimension of the state vector. Further details on the experiments are given in SI Appendix.
A quick convergence of the prior to the posterior density in terms of the mapping iterations is the key ingredient for highdimensional applications. The applications we have in mind satisfy
. Therefore, we evaluated, apart from stochastic gradient descent with a fixed learning rate
, adadelta (Zeiler, 2012) and adam (Kingma and Ba, 2015) which have an adaptive learning rate for each state variable (see SI Appendix, SI.F). These stochastic optimization methods are particularly suitable for highdimensional spaces.The maximum number of mapping iterations is fixed to , and a threshold for the number of effective particles is used as criterion of convergence, . As mentioned, a threshold in the absolute value of the gradient could also be used as criterion of convergence. In SI Appendix, SI.F we compare these two criteria for convergence of the variational mapping.
3 Results
Figure 1 shows the marginalized prior density as a function of the three state variables of the Lorenz63 system determined at the 115th time assimilation cycle, (each panel exhibits a variable). For plotting, the marginalized density is determined with kernel density estimation. The cycle was chosen so that the differences between the mapping particle filter (MPF) and the sampling importance resampling (SIR) filter are emphasized (i.e. a cycle for which the apriori density is not a good representation of the posterior density). The experiment uses 100 particles and the full state is observed. Because particles are spread in and out the high observation likelihood region, the proposal density is broader, asymmetric compared with the marginalized sequential posterior density. After the variational mapping, the resulting density is a close representation of the sequential posterior density, the target density for the mapping, in the three variables (see the three panels in Fig. 1).
The distribution of the particles can be seen in Fig. 2. Upper panels of Fig. 2 show the particles belonging to the prior density for each variable, the initial density for the mapping. Contours show the posterior density, marginalised to the 2dimensional plane. Lower panels show the final locations of the samples after 50 mapping iterations. The mapping has pushed most particles towards the high probability region of the posterior density. Note that the algorithm is not collapsing the particles toward the mode of the posterior, but representing the density with the samples as a whole. The diversity of the particles produced by the mapping is notable.
The same experiment conducted with the standard SIR, or bootstrap filter using resampling when is shown in Fig. 3. The sampling deficiencies in the SIR filter are visible in Fig. 3 where the particles of the prior density (upper panels) and the resampled density (lower panels) are shown with dots. Because the resampling conserves and replicates statistically only the particles with high likelihood, there is an evident sample impoverishment even in this lowdimensional twin experiment.
Because of the good sampling, the MPF exhibits a very weak sensitivity to the number of particles in the rootmeansquare error (RMSE) measure. Figure 4 shows the RMSE of the analysis mean with respect to the true state, for MPF and SIR filters for 100, 20 and 5 particles (Fig. 4 a, b and c respectively). Even for a large number of particles (with respect to the state dimension), the MPF outperforms the mean estimation with respect to the SIR filter. SIR filter diverges for 5 particles. Figure 4 a, b and c shows that the timemean RMSE for the MPF practically does not change between 100 particles with a timemean RMSE of 0.482 to 5 particles whose timemean RMSE is 0.489. Thus, the performance of the MPF is quite robust, only small changes in the RMSE are found when decreasing the number of particles.
Note the complexity of Algorithm 1 is proportional to while it is proportional to for the SIR filter, so that the computational cost of the assimilation stage is higher for the MPF with the same number of particles. However, a smaller number of particles is required so that it represents a promising venue for expensive dynamical models.
To evaluate the performance of the MPF for a larger statespace model, an experiment using the 40variable Lorenz96 dynamical system was conducted. An ensemble of 20 particles was used in the experiment. Scalable sampling methods are expected to produce useful results for these experiments in which . Figure 5a compares the resulting RMSE of the state given by MPF and the stochastic EnKF without localization as a function of time. The SIR filter is not shown since it exhibits degeneracy in this experiment because of the high dimensional state and observation spaces. The MPF takes a longer time to converge (20 cycles), but the EnKF exhibits a larger RMSE (this result might be improved using more members or localization, although the Gaussian assumption will hit the EnKF results). Furthermore, the mapping particle filter estimates are rather stable in time, which is a result of its deterministic mapping in which the particles feel each other. The spread is rather stable and comparable to the RMSE in both filters (Fig. 5b).
A third set of experiments was conducted using a compartment stochastic dynamical model for cholera dynamics (Ionides et al, 2006). This is a 5variable system allowing for susceptible, infected, and three classes of recovery individuals in which cholera mortality is the only observed variable. From the inference point of view, it presents some challenges(Ionides et al, 2006). It has partial noisy observations whose variance depends on time. Transmission is assumed stochastic and has a multiplicative model error. Since our implementation of the MPF assumes additive model error, we have augmented the state space with a new variable with an evolution equation equal to the noise factor and in this way the augmented cholera model can be represented by additive model error terms. The ensemble is composed of 100 particles and a small artificial additive model error term was added to benefit particle diversity (Liu and West, 2001). Note that a nonGaussian density for additive model error would give more realistic features. Figure 6a shows the true mortality time series and the one estimated by the MPF. The experiment started with an initial mean state which underestimates the number of true susceptible individuals by 20% . A good performance is found for the MPF (total RMSE is 0.0031). The SIR filter follows the cholera outbreaks but the estimate is much more noisier when the same level of noise is used as for the MPF (total RMSE is 0.0039). By decreasing the level of noise in the SIR filter, the noise diminishes but at the expense of a strong underestimation of the peaks (total RMSE is 0.0077).
4 Discussion
The proposed variational mapping particle filter is based on a deterministic gradient flow. It is able to keep a large effective sample size, avoiding the resampling step, even for long recursions subject to the convergence of the mapping sequences. Furthermore, it shows a robust behavior in terms of the RMSE with a consistent ensemble spread in the conducted experiments, using only a small number of particles.
In the limit of a single particle, , the gradient of the KullbackLeibler divergence is equal to the negative of the gradient of the logposterior density (see (17)). In that case, the optimization via gradient descent determines the mode of the posterior density. Therefore, the method for is equivalent to threedimensional variational data assimilation (3DVar) with in the role of .
The mapping of samples in the variational mapping particle filter is produced by moving the particles where they maximize the information gain of the proposal density with respect to the sequential posterior density, via the KullbackLeibler divergence. In these terms, the proposed mapping particle filter seeks to maximize the amount of information available in a complex highdimensional posterior density using a stochastic optimization method and given a limited number of particles.
Acknowledgements. We are grateful to the reviewers for valuable comments. This work has been founded by CUNDA project of the European Research Council. A python library with the mapping particle filter is available at http://github.com/pulidom/mpf/.
References
 Angenent et al (2003) Angenent, S., Haker, S. and Tannenbaum, A. (2003) Minimizing flows for the Monge–Kantorovich problem. SIAM journal on mathematical analysis, 35, 6197.
 Arulampalam et al (2002) Arulampalam, M.S., Maskell, S., Gordon, N. and Clapp, T. (2002) A tutorial on particle filters for online nonlinear/nonGaussian Bayesian tracking. IEEE Transactions on signal processing, 50, 174188.
 Atkins et al (2013) Atkins, E., Morzfeld, M. and Chorin, A.J., (2013) Implicit particle methods and their connection with variational data assimilation. Mon. Wea. Rev., 141, 17861803.
 Bunch and Godsill (2016) Bunch, P. and Godsill, S., (2016) Approximations of the optimal importance density using Gaussian particle flow importance sampling. Journal of the American Statistical Association, 111, 748762.
 Cappe et al (2005) Cappé O, Moulines E, Rydon T. (2005) Inference in Hidden Markov models. Springer Science+Business Media: New York, NY.
 Chen and Reich (2015) Cheng, Y. and Reich, S., (2015) Assimilating data into scientific models: An optimal coupling perspective. In Nonlinear Data Assimilation, 75118. Springer.
 Chorin and Tu (2009) Chorin, A.J. and Tu, X., (2009) Implicit sampling for particle filters. Proc. Nat. Acad. Sci. USA, 106, 1724917254.

Daum and Huang (2007)
Daum, F. and Huang, J., (2007) Nonlinear filters with loghomotopy.
In Signal and Data Processing of Small Targets 2007. 6699, p. 669918.  Doucet et al (2000) Doucet, A., Godsill, S. and Andrieu, C. (2000) On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and computing, 10, 197208.
 Ionides et al (2006) Ionides, E.L., Bretó, C. and King, A.A. (2006) Inference for nonlinear dynamical systems. Proc. Nat. Acad. Sci. USA, 103, 1843818443.
 Kingma and Ba (2015) Kingma, D. and Ba, J. (2015) Adam: A method for stochastic optimization. In Int. Conf. on Learning Repres. (ICLR) arXiv preprint arXiv:1412.6980.

LeCun et al (2015)
LeCun, Y., Bengio, Y. and Hinton, G. (2015) Deep learning.
Nature, 521, 436444.  Li and Coates (2017) Li, Y. and Coates, M., (2017) Particle filtering with invertible particle flow. IEEE Transactions on Signal Processing, 65, 41024116.
 Liu and Wang (2016) Liu, Q. and Wang, D. (2016) Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, 23782386.
 Liu and West (2001) Liu, J. and West, M., (2001) Combined parameter and state estimation in simulationbased filtering. In Sequential Monte Carlo methods in practice, 197223. Springer, New York.
 Marzouk et al (2017) Marzouk Y, T Moselhy, M Parno, A Spantini (2017) An introduction to sampling via measure transport. To appear in Handbook of Uncertainty Quantification; R. Ghanem, D. Higdon, and H. Owhadi, editors; Springer. arXiv:1602.05023
 McCann (1995) McCann, R. (1995) Existence and uniqueness of monotone measurepreserving maps. Duke Mathematical Journal 80, 309–323.
 Moselhy and Marzouk (2012) Moselhy TA, Marzouk YM (2012). Bayesian inference with optimal maps. J. Comp. Phys.. 231, 78157850.
 Pulido and Rosso (2017) Pulido M. and O. Rosso (2017) Model selection: Using information measures from ordinal symbolic analysis to select model subgrid scale parameterizations. J. Atmos. Sci., 74, 3253–3269 doi: 10.1175/JASD160340.1
 Pulido et al. (2018) Pulido M., P. Tandeo, M. Bocquet, A. Carrassi and M. Lucini, (2018) Parameter estimation in stochastic multiscale dynamical systems using maximum likelihood methods. Tellus, 70, 1442099.
 Reich (2013) Reich, S. (2013) A nonparametric ensemble transform method for Bayesian inference. SIAM Journal on Scientific Computing, 35, A2013A2024.

Scholkopf and Smola (2002)
Scholkopf, B. and Smola, A.J. (2002)
Learning with kernels: support vector machines, regularization, optimization, and beyond.
MIT press.  Tabak and VandenEijnden (2010) Tabak, E. G. and VandenEijnden, E. (2010) Density estimation by dual ascent of the loglikelihood. Comm. Math. Sci., 8, 217233.
 vanLeeuwen (2010) vanLeeuwen P.J. (2010) Nonlinear data assimilation in geosciences: an extremely efficient particle filter. Quart. J. Roy. Met Soc.,136, 19911999.
 vanLeeuwen (2015) vanLeeuwen, P.J. (2015) Nonlinear Data Assimilation for highdimensional systems. In Nonlinear Data Assimilation, 173. Springer.
 Villani (2008) Villani, C. (2008) Optimal transport: old and new. Vol. 338. Springer Science & Business Media.
 Zeiler (2012) Zeiler, M. D. (2012) ADADELTA: An adaptive learning rate method. arXiv: 1212.5701 [cs.LG]
 Zhu et al (2016) Zhu, M., Van Leeuwen, P.J. and Amezcua, J. (2016) Implicit equal weights particle filter. Quart. J. Roy. Met. Soc., 142, 19041919.
Si Supporting Information
si.a Hidden Markov models with Gaussian errors
Next, we derive the expression of the posterior density in the sequential framework as a function of the particles at . For the numerical experiments, model and observational errors are assumed Gaussian. These are taken as an example, but the technique framework is general. In particular, it is suitable for nonadditive and nonGaussian errors. The prediction probability density can be obtained analytically under the additive Gaussian model error assumption, in (1). The evolution of the particles from (3) gives
(S1) 
where . We assume equalweighted particles here; it is straightforward to assume weighted particles at time if so desired.
The observational error is also assumed additive Gaussian, in (2), so that the observational likelihood at is
(S2) 
The Bayes rule, (4), is used to obtain the sequential posterior density combining the prediction density, (S1), and the observation likelihood, (S2). The sequential posterior density given the particles at the previous time step is proportional to
(S3) 
This is the target density that the particles as Monte Carlo samples should be representing in the filter.
The gradient of the logarithm of the sequential posterior density, (S3), at is given by
(S4) 
with ,
and . Reducing we obtain
(S5) 
This gradient of the logposterior, (S5), does not depend on the unknown normalization constant of the sequential posterior density. Hence, the numerical evaluation of the gradient of the KullbackLeibler divergence is feasible.
si.b Reproducing kernel Hilbert space
Kernel functions are used in machine learning for finding regressions in datasets using an implicit feature space (Scholkopf and Smola, 2002). Instead of determining the data representation in that space, only the inner products of the images of two data points in the feature space are required. Then, the inner products in the image space are replaced by kernels in the feature space, the socalled kernel trick.
Some of the properties of the kernel are:

Let be a positive definite kernel. Mercer’s theorem says that the spectral decomposition
(S6) is given by a uniformly convergent series of the eigenfunctions
. The eigenfunctions satisfyand the eigenvalues are ordered, the larger the first.

Let be a RKHS, and , two functions of such that , . The inner product of the functions in is given by
(S7) 
satisfies the reproducing property,
(S8) 
Every positive definite defines a unique RKHS.
If where and , then the inner product in is .
si.c Interpretation of the gradient of the KullbackLeibler divergence
The first term of the gradient of the KullbackLeibler divergence, (17), tends to drive the particles towards the peaks of the posterior density. It gives a weighted average of including the current particle and surrounding ones within the kernel scale of the current one. This is the typical behavior of a variational importance sampler so that it accumulates particles at the high probability regions of the posterior density. On the other hand, the second term of in (17) tends to separate the particles. If the current particle is in the influence region of another particle say , i.e. within the kernel scale, the term will tend to separate them acting as a repulsive force between particles.
Suppose particle is into consideration. Particles that are far away, , do not contribute to the two terms in (17). On the other hand, if there is an th particle that is within the kernel scale of , the th particle will be driven by a force from the second term in (17) given by,
(S10) 
where so that the force given by this term is repulsive. It tends to separate the particles that are within the kernel scale.
si.d Importance sampling and the variational mapping particle filter
The variational mapping sampling scheme is suitable to be coupled with importance sampling. The last intermediate density obtained with the variational mapping sampling scheme can be used as a proposal density of the posterior density. Suppose we denote as to the particles obtained in the last mapping iteration. The values of the proposal and posterior densities at the particle positions are used to estimate the weights
(S11) 
and then they are normalized with
Since the output of the filter is a set of weighted samples, the sequential posterior density (e.g. (S3)) has to include the weights of the previous cycle,
(S12) 
To compute the weights between the posterior density and the proposal density –final density of the MPF–, we also need to evaluate the proposal density at the particle locations. We require the proposal density to be known (apart from the set of samples that we have from that density). There are two ways to obtain it. One way is to use kernel density estimation with the set of samples, , using the same kernels as in the mapping. The second option is to determine at each mapping iteration how the density is transformed with the mapping (e.g. Moselhy and Marzouk (2012)).
Before the mapping, we choose as initial intermediate density, for instance, an equally weighted Gaussian mixture centered at and with covariances ,
(S13) 
In the following mapping iterations , the density changes according to the mapping ,
(S14) 
where is the determinant Jacobian of the transformation and is the Hessian of the KullbackLeibler divergence. From (17), the Hessian is given by
(S15)  
The calculation of the Hessian of the KullbackLeibler divergence, (S15) is computationally demanding. Although note that the second derivatives are well known for standard kernel functions and the symmetry of the kernel can be used to reduce the calculations. However, the Hessian calculation is required at each mapping iteration to evolve from the previous intermediate density .
The weights account for the bias introduced in the filter when the set of samples obtained by the sequence of mappings is not exactly distributed according to the target density. However, it should be kept in mind that our estimate of the posterior density (S12) is not exact, so it is unclear what the calculated weights actually mean. It would be interesting to determine the accuracy of this estimate, but that is beyond the scope of this work.
The weights are also useful as a diagnostic of the quality of the convergence of the optimization method. In that case, an evaluation of the weights for the current intermediate density is conducted in each mapping iteration. If the given weights give an effective sample size which is over the threshold no further mapping iterations are required. In the conducted experiments of this work, in general, the effective sample size was over 98% after 50 mapping iterations during the whole recursion. In other words, if the number of mapping iterations is large enough , the weights will be almost equal. In highdimensional systems, a convergence criterion based on the module of the gradient of KulbackLeibler divergence can be used. This avoids the calculations of the Hessian or kernel density estimations. The convergence of the technique is evaluated through experiments with both measures in Section SI.F.
si.e Experiment details
The Lorenz63 system is given by
(S16)  
The equations are solved using a fourthorder RungeKutta scheme. The parameters are chosen at their standard values , and . The integration time step is and the assimilation cycles are every (10 integration time steps). The observation operator
is taken as the identity matrix, for the fullobserved state experiments in Section
3. Further experiments are conducted in Section SI.H using partial observations. The model error covariance is chosen diagonal. Its diagonal elements are set to 30% of the climatological variance of the variable, e.g. , where is the timeseries variance of variable . The observation error covariance is . The Lorenz63 experiments use importance sampling to evaluate the bias introduced by the sampling. The weights are evaluated using kernel density estimation with radial basis functions in the state space in coherence to the ones used for the mapping (same Gram matrix).For the 40variable Lorenz96 system experiments, the set of equations is
(S17) 
where and cyclic boundary conditions are imposed, , and . The equations were integrated using a fourthorder Runge–Kutta scheme, with an integration step of 0.001. The forcing is which results in chaotic dynamics. The observational time resolution is . The observational error covariance is . The model error covariance is . This amplitude of the model error noise is representative for the twoscale Lorenz 96 system which was estimated to be 0.3 using information measures by Pulido and Rosso (2017)
and 0.305 using an ExpectationMaximization algorithm coupled to an ETKF
Pulido et al. (2018). The ensemble is composed of 20 particles. The initial ensemble particles are states taken randomly from a climatology. Because of the large dimension, we do not use the effective number of particles via kernel density estimation as in the Lorenz63 experiments to evaluate convergence of the optimization, but the module of the gradient of the KullbackLeibler divergence.The compartment stochastic dynamical model for cholera is the one throughly described in (Ionides et al, 2006)
. A short description is given here. The individuals are classified as susceptible (
), infected individuals (), while recovered individuals belong to three classes () to allow for different inmune periods. The equations for the number of individuals for each category are given by(S18)  
Transmission is given by a stochastic differential equation
(S19) 
where
is a Gaussian white noise. The transitions between categories are given by
where , represents birth, is cholera mortality and denotes death from other causes. Equations S18 are integrated using an EulerMaruyama scheme with time step of month. The observations are cholera mortality data which are given by
(S21) 
While the mapping particle filter can handle nonadditive model errors we decided to transform the system of equations above to allow for an additive model error implementation. We use an augmented state space defining a new state variable as , so that the augmented system has additive model error whose the gradient of the logposterior density is represented by (S3). To add diversity to the particles in the filter a small additive noise term is added in all the equations. The variance of this additive noise is set to 10% of the variance. This noise is not used to generate the true trajectory, it is only included in the particle filter.
si.f The stochastic optimization methods and their convergence
An optimization of the KullbackLeibler divergence is conducted in the mapping particle filter to push the prior particles to samples from the posterior density. The particles are driven in a purely deterministic transformation by the variational mapping from the prior density to the posterior. The intermediate densities are represented through the particles. Each particle is then moved along the direction of the steepest descent of the KullbackLeibler divergence, which considers all the particle positions.
Several stochastic gradientbased optimization methods have been recently developed in the machine learning community (Zeiler, 2012; Kingma and Ba, 2015) which are directly applicable to the mapping particle filter. They are focused on efficient convergence in highdimensional control spaces when the cost function is noisy. The source of this noise is subsampling. These optimization methods, and their succesful convergence rates, have been instrumental for the success of deep learning applications (e.g. LeCun et al 2015). The variational mapping in this work is based on stochastic gradientbased optimization methods.
The stochastic gradient descent method has a fixed learning rate for all the iterations and all the control variables. This is wellknown to cause some inconveniences, in particular along extended valleys where zigzag convergence arises. As the gradient decreases in some directions, the convergence along those variables is very slow. The adaptative schemes search for a dynamic learning rate for each control variable. Ideally, a Newton method would be the optimal learning rate if the optimization problem is quadratic, but it requires Hessian evaluations which are computationally prohibitive in highdimensional spaces. The learning rate for each control variable in the stochastic optimization method, adadelta (Zeiler, 2012), is based in a global learning rate over a running average of the past absolute directional derivatives along the direction of the control variable. Therefore, the learning rate increases for directions with small directional derivatives. The method improves substantially but still some stagnation after the first iterations has been reported. To overcome this weakness, the algorithm for stochastic optimization adam(Kingma and Ba, 2015)
estimates the first and second moments of the gradients with bias correction. These two stochastic optimization algorithms, adadelta and adam, together with a descent gradient method were evaluated in the experiments. The chosen initial learning rate of the optimization methods is a tradeoff between convergence speed and smoothness flow assumptions. The recommended values exhibited a good performance, so that no manual tuning of the learning rate was required
(Zeiler, 2012).The convergence of the variational mapping is shown in Fig. S1 as a function of the mapping iteration at the first assimilation cycle. At the first cycle, the three optimization methods, descent gradient, adadelta and adam, have the same initial density. Upper panels of Fig. S1 show the rootmeansquare of the KL gradient, and lower panels the number of effective particles. The adadelta converges fastest, close with adam. For a larger number of iterations, the gradient of KL divergence is still diminishing for adam but almost no change after 25 iterations for adadelta. However, the number of effective samples does not differ between them, 40 iterations are required to achieve the maximum–98% of the total number of particles. The descent gradient requires about 200 iterations to achieve this number of effective particles.
Figure S2 shows the convergence at the 10th assimilation cycle of the optimization methods for evaluating the recursive effects. Notably, adadelta has the fastest convergence with the minimum rootmeansquare gradient similar to the one obtained in the first iteration. On the other hand, adam converges slowly in the first mapping iterations and starts to increase the convergence rate after 20 iterations (an effect that is likely caused by the momentum equations in adam which require some iterations to “warm up”). Adadelta has reached the maximum number of effective particles in 50 iterations. Because of the fast convergence and because the effective particle numbers is the most relevant parameter for the particle filter we took adadelta with 50 iterations and an initial learning rate of 0.03 as the default setting for the experiments.
There is a clear correlation between the decrease of the rootmeansquare of the KL gradient and the increase in the number of effective particles in Fig. S2 for the three stochastic optimization methods. This implies as mentioned that both measures could be used to determine the convergence in the optimization and the required number of iterations for a given threshold. Along this line, we note that the KullbackLeibler divergence can be directly expressed in terms of the weights.
Using Monte Carlo integration of the KullbackLeibler divergence,
(S22) 
from (S11), it results
(S23) 
As expected if the weigths are equally distributed in the particles, the KullbackLeibler divergence is zero, then . In other words, as the number of effective particles tends to , the KullbackLeibler divergence tends to zero, its minimum.
si.g Experiments on sensibility of the MPF
The performance of the mapping particle filter can be evaluated through the weights of the particles. As a function of the mapping iterations of the filter, the sequence of mappings is expected to start with a set of particles with most of the particles with almost zero weights and a few particles with very large weights. Note from (S23) that particles with almost zero weights produce a large KullbackLeibler divergence. Then, as the particles are pushed toward the posterior density, their weights will tend to be equally distributed for most of the particles. If the variational mappings work effectively, only a few particles should remain with low weights. In other words, the variance of the weights should be small for all the cycles. Figure S3a shows the evolution of the weights of all the particles as a function of time for and without threshold for the Lorenz63 system. Most of the particle weights are concentrated around 0.05, between 0.04 and 0.07 ( particles). There are some cycles in which one particle obtains a low weight but in the following cycle the weight of that particle is again within the equallyweighted range. Note that we did not perform any resampling. The variance of the weights is around (Figure S3b). The number of effective particles is in general about 19 (from a total of 20 particles), but in a few cycles it can descend down to 16. A further experiment was conducted in which the maximum number of iterations was increased to 100. In that case, all the particles remain with weights larger than 0.04 during the whole time sequence and the peaks of variance of the weights found in Fig. S3b are not present (see Fig. S4). The number of effective particles is in this case over 18 for all the cycles
To evaluate the MPF without using importance sampling, we conducted a Lorenz63 experiment without using weights in the particles after the mapping. As criterion of convergence we use averaged over all the particles. The MPF experiment uses 20 particles and is compared with a SIR filter using 20 particles and another with 100,000 particles. The last is expected to be close to the true posterior density. Figure S5 shows the marginalized posterior densities of the three experiments in each Lorenz63 variable. Upper panels correspond to cycle 450th and lower panels to cycle 500th. The marginalized densities given by the MPF with 20 particles are remarkably close to the SIR experiment with 100,000 particles after a long sequence of cycles, without using resampling nor importance sampling. On the other hand, the SIR filter with 20 particles has large differences with the true posterior density not only in the mode but also the uncertainties are larger than the true marginalized posterior density in some directions and smaller in other ones. Because of the resampling step some tendency of the SIR with a small number of particles to generate a bimodal marginalized densities in some variables was also found.
The radial basis functions used as kernel require to set the bandwidth, which in this work was chosen as . The parameter depends on the number of particles and the dimensions of the problem and requires some tuning. The optimal parameter will be the one that produces the sample spread equal to the spread of the sequential posterior density. The bandwidth parameter used in the Lorenz63 experiment with 20 particles is . For the experiment with 100 particles we use . Figure S6 shows experiments for the 40 variables Lorenz96 system in which we vary the bandwidth from to . The RMSE shows only slight changes with similar values in the three experiments (Figure S6a). The RMSE measure for the MPF appears rather robust to the kernel bandwidth. Figure S6b shows the spread of the experiments for different values. A choice of appears to give an optimal spread of the particles for this experiment.
si.h Experiments using partial observations
An experiment to evaluate the performance of MPF for a partially observed Lorenz96 system (40 variables and 20 observations) was conducted using 20 particles. Figure S7a shows the RMSE given by the MPF and the stochastic EnKF without localization. The MPF outperforms the EnKF in terms of the RMSE measure. Figure S7b shows the spread of the particles for both filters.
Figure S8 shows the mean state in the three Lorenz63 variables for the partial observation experiment. They are estimated by the particle filters using just 5 particles for the case in which only is observed and , . The MPF filter (upper panels) manifests a very good performance with the mean state estimate in the observed variable very close to the true value, while it is also capturing rather closely the behavior of the unobserved variables (Panels (b) and (c)). The information that is available in the observed variable is enough to recover through the prediction density correlations of the full hidden state. The SIR filter for the same conditions and same set of observations diverges (lower panels of Fig. S8), it finishes with all the weight in only one particle.
As a further experiment, a more challenging one, we evaluate the performance of the MPF for long assimilation windows and partial observations, the assimilation cycle is extended to . Note the integration step is kept as in all the experiments at so that there are 100 model time steps between observations. The number of particles is 20. In this case, we expect a much stronger effect of the nonlinear terms in the Lorenz63 equations. Figure S9 shows the mean state estimates with the MPF (upper panels) and SIR (lower panels). While the SIR filter diverges, the MPF exhibits a very good performance considering the small number of particles and partial sporadic observations.
A third partial observation experiment is evaluated in which only is observed. The experiment with only as observed variable does not contain all the dynamical information from observations to recover the full state, since the variable does not contain information about the wing of the attractor that the other two variables are in. We conducted an experiment with , and 20 particles. Upper panels of Fig. S10 show the results for the MPF experiment. As expected, the filter is unable to recover the state in and variables for some assimilation cycles. However, a good performance is found with the MPF, it appears to recover quickly after failing to detect the attractor of the system. On the other hand, the SIR filter (lower panels of Fig. S10) looses track and it is unable to recover the true trajectory for a large number of cycles.
Comments
There are no comments yet.