1 Introduction
In this paper, we consider a type of optimization problem that arises in supervised machine learning (ML), which can be formulated as follows,
(1)  
where denotes the training data points,
the feature vector,
the label, and the loss function. In the context of ML, this problem is known as empirical risk minimization (ERM). We consider cases in which the number of training data points is so large that it is infeasible to use classical first or second order optimization methods. These methods require gradients of , to be evaluated and stored at each iteration. When is large, the cost for evaluating and storing these gradients is too expensive.The massive scale of modern data sets poses a significant research challenge for developing efficient algorithms to solve such largescale ERM problems reddi2017new ; bottou2018optimization . A commonly used approach is SGD, which is strongly connected with the stochastic approximation methods. SGD can be regarded as a stochastic version of the gradient descent methods bottou2010large ; bottou1998online . The key idea is to substitute the expensive gradient calculation with cheap estimations of the gradient. So the question is how to get a cheap while effective estimation of the true gradient? The common practice in SGD is to randomly choose a subset of the training data points, and then calculate an estimate of the true gradient based on the chosen data points. The estimated gradient is then leveraged for parameter updating at each iteration. The main advantage of SGD consists of the low per iteration computation and memory requirement and a frequently fast initial convergence rate, while the success of SGD demands an elaborately designed scheduling of the stepsize to ensure convergence. Despite that a significant amount of work has been conducted, see e.g. duchi2011adaptive ; schaul2013no ; mahsereci2015probabilistic , there still lacks a generally applicable approach to tune the stepsize in a theoretically sound way.
As an alternative to SGD, the class of IPMs has also attracted a lot of attentions in the literature parikh2014proximal ; bertsekas2011incremental ; aky2018the . The key procedure of IPMs is an online version of the proximal operator. This operator minimizes a single or a minibatch of components of the cost function by a local search over a constrained parameter space, centered around the solution given at the previous iteration. This local parameter space is specified in the form of a regularizer. Compared with SGD, IPMs are preferable for linear cases, since an analytical form of the iterative solution is available parikh2014proximal ; bertsekas2011incremental . For nonlinear cases, IPMs are prone to become computationally inefficient, as no tractable analytical solution exists, and an iterative numerical solver is required to perform a local search at every proximal step parikh2014proximal ; bertsekas2011incremental . In addition, the IPM suffers from numerical instability when the parameter estimate gets close to the actual minimum aky2018the . This is caused by the lack of a mechanism to reduce the stepsize in local parameter space exploration.
Recently KFs have been explored to improve typical IPMs based on a probabilistic explanation of the proximity operator. It is revealed that the proximal operator is in theory closely related with Bayes update aky2018the . Motivated by this observation, a KF based IPM (KFIPM) is derived for online leastsquare fitting of linear Gaussian models. If the model to be fitted is nonlinear, the traditional KF can be replaced with EKF. The resulting algorithm is termed EKFIPM here. The usage of KF iterations in IPMs provides a natural dampening mechanism for parameter updates, making the resulting KFIPM and EKFIPM much more numerically stable than traditional IPMs aky2018the . However, KFIPM and EKFIPM have severe problematic issues, inherited from the intrinsic limitations of KF and EKF. The first issue is that the working of KFIPM and EKFIPM requires an ad hoc form of , namely . Only when this requirement is satisfied, the closedform KF formulas can be derived. Another problematic issue is that EKFIPM is likely to be divergent when is highly nonlinear, due to a significant model mismatch caused by the employed first order linearization of . As we know, many ML tasks use different forms of nonlinear loss functions, e.g., the logistic function . Here denotes the transposition of (we assume that both and are column vectors herein and after). The presence of the aforementioned problematic issues prevents KFIPM and EKFIPM from being widely used for addressing real ML tasks.
It is a consensus that the class of particle filters (PFs) outperforms the KF and its variants remarkably for nonlinear and/or nonGaussian statistical filtering. Hence, it is natural to ask if stochastic optimization can benefit from the PF theory to get around of the limitations of KFIPM and EKFIPM. We provide an affirmative answer to the aforementioned research question by deriving two novel PF based stochastic optimizers (PFSOs). We also provide a discussion on connections our methods have to: (a) PF methods for stochastic dynamic filtering; (b) PF methods for static parameter estimation; and (c) existent PF methods for optimization.
The remainder of this paper is organized as follows. In Section 2, we briefly review the link between KF and IPMs. In Section 3, we present the proposed PFSOs in detail. In Section 4, we discuss connections our methods have to other related work. In Section 5, we show experimental evidence on superiorities of our methods. Finally, we conclude the paper in Section 6.
2 Revisit the link between KF and IPMs
Here we briefly review the link between KFs and IPMs. For more details, readers are referred to aky2018the and references therein. This review provides the necessary background information required for developing the proposed PFSO methods in Section 3.
IPMs solve problems of the form of (1) in an iterative way. Each iteration produces an updated estimate of . Denote the estimate of generated at iteration as . Given , only a single component of , namely , is involved for generating . Specifically, is obtained through solving the following problem
(2) 
where denotes the proximal operator, a regularization parameter, is a symmetric positive definite matrix, , the inverse of . The same as in aky2018the , we slightly abuse the notation here for simplicity, which actually stands for , where is a random sample drawn uniformly from . For cases in which , the solution to (2) is shown to be aky2018the :
(3) 
Now we define a model that consists of a prior density function and a likelihood function as follows,
(4) 
where
denotes a Gaussian distribution with mean vector
and covariance matrix . Let . Since both the prior density and the likelihood function are Gaussian, the posterior distribution is also Gaussian. Let . Then we have haykin2001kalman ; aky2018the :(5) 
(6) 
The above equations (5)(6) constitute the recursion of KFIPM at iteration . The only difference between (5) and (3) is that is substituted with . The corresponding proximity operator addressed by KFIPM can be formulated as
(7) 
By comparing (7) with (3), we see that KFIPM stands for a special class of IPM that can adaptively tune over iterations. It has been demonstrated that, by adapting according to (6), the KFIPM is markedly more numerical stable than typical IPMs aky2018the .
The above KFIPM recursion is only applicable for cases in which , and is of the form of , . The EKFIPM is derived to handle cases in which , is of the form of , , is nonlinear and differentiable. The recursion of EKFIPM is aky2018the
(8) 
(9) 
where , denotes the gradient operator with respect to .
Despite advantages over traditional IPMs aky2018the , such KFtype IPMs have two unresolved issues that prevent them from being widely applied. The first one is the restriction on the form of . It should have a form , which is required for deriving the KF or EKF recursions. In ML applications, different forms of nonlinear loss functions, e.g., the logistic function , are often used. For such applications, neither KFIPM nor EKFIPM can be used. The second issue is that, if is highly nonlinear, the estimate given by EKFIPM is likely to be divergent due to a significant model mismatch caused by the linearization operation on . In the next section, we present PFSOs that resolve the above issues in an elegant way.
3 The Proposed Particle based Stochastic Optimization Methods
Motivated by both the advantages and the limitations of KFIPMs, as revisited in Section 2, we explore PF as an alternative of KF for addressing the stochastic optimization problem formulated in (1). We first present a generic particle based scheme that extends KFIPM in a straightforward way. Then we introduce the proposed PFSOs to implement the above scheme.
3.1 A Particle Scheme for Stochastic Optimization
Consider a model defined by a prior density function and a likelihood function as follows,
(10) 
The prior density in the above model is the same as in (4). Recall that, the KFIPM is derived based on the model in (4). At iteration , KFIPM outputs as an updated estimate of , and, based on the model (4), happens to be the mean of the posterior . Different from the model in (4), here we allow to be of any form, e.g, the logistic form . For the logistic case, neither KFIPM nor EKFIPM is applicable for calculating the mean of the posterior, because their recursion equations cannot be derived again. We resort to particle based methods to estimate the mean of the posterior .
The basic idea is to run a sequential importance sampling (SIS) procedure to simulate , then seek an estimate of using the samples yielded from the simulation of . The SIS procedure is the backbone of all PF methods. Following the standard in PF related literature, we call sampled values of as “particles” in what follows. Suppose that, standing at the beginning of iteration , we have a weighted particle set , that provides a Monte Carlo approximation to , namely
(11) 
where denotes the delta function located at . Then a twostage operation is performed in the SIS procedure. First, draw a set of new particles from a proposal function : . Then, calculate importance weights of these particles as follows
(12)  
(13) 
Under mild conditions and with an appropriate design of the proposal function, this updated particle set can provide a satisfactory Monte Carlo approximation to arulampalam2002tutorial , namely,
(14) 
Then can be calculated as below
(15) 
The SIS algorithm has a seriously problematic issue, namely the variance of the importance weights increases stochastically over iterations
arulampalam2002tutorial . The variance increase will cause the phenomenon of particle degeneracy, which means that, after a few iterations, one of the normalized importance weights approaches one, while the others tend to zero. To reduce particle degeneracy, a resampling procedure is usually used to eliminate samples with low importance weights and duplicate samples with high importance weights arulampalam2002tutorial . A number of resampling schemes, such as residual resampling and minimum variance sampling, have been proposed in the literature, but it has been reported that their impacts on the final performance are not significantly different among each other van2001unscented . We used residual resampling in all our experiments in Section 5. A pseudocode illustrating the PF scheme for stochastic optimization (PFSO) is presented in Algorithm 1.The above PFSO scheme provides the basis for developing PF based stochastic optimization algorithms, while a critical issue, namely the choice of the proposal function , has not been addressed so far. For sampling from , we hope that, by choosing an appropriate , the variance of the importance weights can tend to zero. In fact, the choice of the proposal function plays an important role in reducing the variance of the importance weights van2001unscented ; liu2008particle ; arulampalam2002tutorial . An empirical guideline is to choose one that mimics the target distribution but has heavier tails. In the following subsections, we present two algorithm designs to implement the PFSO scheme.
3.2 Kernel Smoothing based PFSO
Here we adopt a kernel smoothing technique, referred to as Liu and West method liu2001combined in the literature, to generate new particles in the context of PFSO. The resulting algorithm is termed kernel smoothing based PFSO (KSPFSO). Suppose that, standing at the beginning of iteration , we have at hand a weighted particle set , that satisfies . We calculate the mean and variance of with particle approximation. Denote the approximated mean and variance by and , respectively. Then sample and set
(16) 
denotes a dimensional zero valued vector. and are free parameters chosen to satisfy , which ensures the mean and variance of these newborn particles to be correct liu2001combined . The operation in (16) brings two desirable results. First, the effect of particle rejuvenation is achieved. Second, it retains the mean of the particles and meanwhile avoids overdispersion of these new particles liu2001combined . We see from (16), when the value of approaches 1, the position of is close to , . Now we let take a value close to 1, then the corresponding proposal from which the new particles are drawn is approximately equivalent to . According to (12), the importance weights of the newborn particles can be obtained as follows
(17)  
(18) 
Note that, if a resampling operation is performed at the end of iteration , then all ’s has a value . So (17) can be substituted with a simpler operation . The KSPFSO algorithm is summarized as follows in Algorithm 2.
3.3 MCMC Assisted PFSO
In KSPFSO, as shown in Algorithm 2, a resampling procedure is used to reduce particle degeneracy. This procedure results in multiple copies of the fittest particles and removal of low weight particles, which may lead to a phenomenon called particle impoverishment arulampalam2002tutorial . The extreme case is that one particle contributes
copies while all the other particles are deleted. Here we propose a Markov Chain Monte Carlo (MCMC) assisted PFSO (MCMCPFSO), in which an MCMC moving step is used to strengthen particle diversity after the resampling step, while not affecting the validity of the approximation. The pseudocode of the MCMCPFSO algorithm is presented in Algorithm
3. Empirical results in Section 5 show that MCMCPFSO is preferable to KSPFSO when the particle size takes a small value, corresponding to cases wherein the posterior distribution is undersampled.4 Connections to Related Algorithms
Here we discuss connections our methods have to other related work.
4.1 Connections to PF methods for dynamic state filtering
Most of PF algorithms are developed in the context of dynamic state filtering gordon1993novel ; arulampalam2002tutorial . In these methods, a state transition prior is usually precisely defined and then adopted as the proposal to generate new particles. For PFSOs, there is no state transition prior function defined, and the new particles are sampled by using a kernel smoothing technique termed Liu and West method liu2001combined . In PFSOs, the proposal distribution, from which new born particles are sampled, is actually an approximation of the posterior in the previous iteration. In addition, the MCMCPFSO is related with the improved PF methods presented in liu2008single ; gilks2001following , which also adopt the MCMC moving step to strengthen particle diversity.
The fundamental difference between PFSOs and PF methods for dynamic state filtering can be summarized as follows. In the context of the former, is treated as an unknown static parameter; while, in the context of the latter, what to be estimated is a dynamic state variable that changes over time.
4.2 Connections to PF methods for static parameter estimation
The PFSOs proposed here have connections to existent PF methods derived for static model parameter estimation, the most representative of which are those presented in chopin2002sequential ; ridgeway2003sequential . Specifically, both MCMCPFSO and methods of chopin2002sequential ; ridgeway2003sequential use an MCMC moving step to bypass particle impoverishment after the resampling step. The difference lies in that, the former employs Liu and West method to generate new particles besides the MCMC moving step, while the latter relies completely on the MCMC moving step to generate new particles. In addition, in the context of PFSOs, the index of the data point that arrives at iteration is , which is randomly and uniformly drawn from the index set . For methods in chopin2002sequential ; ridgeway2003sequential , the data point processed at iteration is deterministically the th item in the training data set.
4.3 Connections to filtering methods for optimization
As a type of recursive filter based stochastic optimization methods, our algorithms have connections to all KFtype methods derived for stochastic optimization, e.g., in patel2016kalman ; aky2018the ; bell1993iterated ; bertsekas1996incremental ; ho1962stochastic . In this paper, we substitute KFs with specially designed PFs in the context of stochastic optimization. With aid of PF, our methods allow any form of the loss function to be used, while the KFtype stochastic optimization methods require an ad hoc type of loss functions to be used. In addition, our methods can handle strong nonlinearity in the model elegantly, while all Kalmantype methods perform unsatisfactorily in addressing highly nonlinear models.
Our methods also have connections to several PF based optimization (PFO) methods in stinis2012stochastic ; liu2017posterior ; liu2016particle , as they all fall within an algorithmic framework termed Sequential Monte Carlo (SMC) sampler del2006sequential . In this framework, the task of optimizing an objective function is translated to be how to sample from a sequence of target distributions , and then evaluate the optimum based on the samples. A basic difference between existent PFO methods and PFSOs proposed here is as follows. To evaluate a candidate solution , the former requires a full computation of , which is too expensive for the case considered here, as shown in (1). This is because the computation of requires access to all training data points and computations of all component functions of . In contrast, our methods only deals with a single component function per iteration, which only requires access to a single training data point.
5 Experimental Evaluations
We seek to experimentally validate two claims about our methods. The first claim is that PFSOs proposed here outperform both KFtype and typical IPMs in handling nonlinear models in the context of stochastic optimization. We tested this claim across 2 synthetic nonlinear leastsquare fitting applications and 6 realworld binary classification applications. The second claim is that PFSOs have the flexibility in handling different types of loss functions with performance guarantee. We tested this claim using 6 realworld binary classification applications. All objective functions to be optimized in our experiments are nonlinear.
5.1 Synthetic data experiments
We considered two nonlinear model fitting experiments, in which the EKFIPM, UKFIPM and the typical IPM are involved for performance comparison. The term UKFIPM is the abbreviation of unscented Kalman filter (UKF) based IPM. It is obtained by substituting the EKF recursion, namely (8)(9), with an UKF recursion. Readers are referred to julier2004unscented for details about UKF.
5.1.1 Leastsquare fitting of a sigmoid function
The setting of this experiment is borrowed from aky2018the . The cost function to be minimized is of the form (1) with
(19)  
(20) 
where , , , , and . We simulated training data points for use, in which and, given , is set to be , where , denotes a by identity matrix. In this experiment, we set and at 2 and 0.1, respectively. The initial estimate of for IPM is randomly selected from a uniform distribution that centered around . As the model is nonlinear, the typical IPM used here is actually an approximate nonlinear IPM, which applies an iterative numerical solver at each iteration. We set the values of all initial particles for PFSOs identically to be . The EKFIPM and UKFIPM are initialized with , where . As all methods involved here adopt the same initial estimate of , a fair comparison of these methods is thus guaranteed. The particle size in PFSOs is set at . Given an estimate of yielded at iteration , the corresponding cost or loss value is defined to be . We use a normalized cost (NC) as the performance metric, defined as .
For each algorithm, we performed 30 independent runs of it and calculated an averaged , , over these runs. The experimental result is plotted in Fig.1. As is shown, KSPFSO and MCMCPFSO have a faster initial convergence rate in the first 10 iterations. They perform comparatively with EKFIPM and UKFIPM in terms of the final cost. All these KFtype and particle based methods are numerically stable, while the typical IPM suffers from instability in iterations of the final phase. Fig.2 illustrates how the performance of PFSOs changes along with the particle size . It shows that, when achieves 500, the performance of these particle based methods tends to its upper limit. In addition, we see that MCMCPFSO performs better than KSPFSO when takes very small values, corresponding to cases wherein the posterior is undersampled.
Fitting a sigmoid function using the typical IPM, EKFIPM, UKFIPM and the proposed PFSOs. 30 independent runs of each algorithm are conducted. This figure shows the result averaged over these runs. The upper panel shows the evolution of the normalized cost over the iterations. We see that KSPFSO and MCMCPFSO perform comparatively with EKFIPM and UKFIPM, and significantly better than the typical IPM. The typical IPM suffers from instability in iterations of the final phase. The lower panel depicts the sum of the absolute values of the entries of
per iterations, which implies that KSPFSO and MCMCPFSO converge faster than the other methods.5.1.2 Leastsquare fitting of a highly nonlinear model
We conducted another leastsquare fitting experiment to further test the methods involved here. The experiment setting is the same as in subsection 5.1.1 except for the formulation of , which is now defined to be:
(21) 
where , . For each algorithm, 30 independent runs of it are conducted. The averaged , , over these runs is calculated and plotted in Fig.3. We see that, for this highly nonlinear case, KSPFSO performs significantly better than the other competitors in terms of minimizing the cost. The typical IPM again suffers from instability in iterations of the final phase.
5.2 Binary classification using real data sets
We then tested the proposed PFSOs on 6 UCI data sets blake1998uci : Haberman downs2001exact , HTRU2 lyon2016fifty , IRIS, Banknote Authentication, Pima Indians Diabetes, Skin Segmentation. Table 1 gives a summary of these data sets. In our tests, the typical IPM, UKFIPM, and a typical SGD algorithm termed Adaline bottou1998online are included for reference. We focused on ERM for binary classification. The objective function is (1), where is the label and the feature vector. Except IRIS, each data set consists of two classes of data instances. IRIS has three classes. We covert IRIS into a twoclass data set by treating its first two classes as one class, and regarding the last class in the original data set as the other class. We considered two loss functions for use, namely the leastquadratic (LQ) function with where , and the logistic function with . For both cases, we have .
Name of data set  Haberman  HTRU2  IRIS  Banknote  Pima  Skin 

Number of Instances  306  17898  150  1372  768  245057 
Number of Attributes  3  9  4  5  8  4 
Algorithm  Loss function  Haberman  HTRU2  IRIS  Banknote  Pima  Skin 

Adaline  LQ  0.2647  0.0899  0.3333  0.3943  0.3490  0.2187 
IPM  LQ  0.2647  0.0514  0.3333  0.0255  0.3490  0.0600 
EKFIPM  LQ  0.2647  0.0514  0.3267  0.0445  0.3490  0.0849 
UKFIPM  LQ  0.2647  0.0514  0.3333  0.0452  0.3490  0.0845 
KSPFSO  LQ  0.2647  0.0447  0.0933  0.0233  0.3060  0.0592 
Logistic  0.2549  0.0363  0.0533  0.0561  0.2708  0.1110  
MCMCPFSO  LQ  0.2647  0.0363  0.1000  0.0241  0.3021  0.0726 
Logistic  0.2582  0.0397  0.0533  0.0437  0.2839  0.1098 
Algorithm  Adaline  IPM  EKFIPM  UKFIPM  KSPFSO  MCMCPFSO 

Scaled time  1  2.6686  2.6544  21.7107 
For algorithm performance evaluation, we use 10fold crossvalidation. Specifically speaking, we split a data set randomly into 10 sub data sets, figure out the through ERM using 9 sub data sets of them, reserving the remaining one sub data set as a test set. This process is repeated 10 times, every sub data set having one chance to act as the test set. Given the optimum that has been found denoted by , we predict the label for a data instance denoted by checking the value of . If this value is bigger than 0.5, then we set the predicted label to 1; otherwise to 1. By comparing the predicted labels with the true labels for data items in the test set, we obtain the error rate, which we use here as the performance metric for algorithm comparison.
For every data set, we do the same initialization for each algorithm. The regularization parameter is set to 0.25. The particle size for PFSOs is set at . All the other parameters are initialized in the same way as that shown in Subsection 5.1. The learning rate parameter in Adaline is set to , where denotes the iteration index.
Table 2 presents the experimental result. It shows that, for every data set, the best classification result in terms of error rate is always given by KSPFSO. For datasets Haberman, HTRU2, IRIS and Pima, the logistic loss function leads to better performance than the LQ type loss function in terms of error rate. For the other two dataset, namely Banknote and Skin, the LQ type loss function is preferable to the logistic loss function for use with KSPFSO. In addition, we can see that the performance of UKFIPM is indistinguishable from that of EKFIPM. We also checked the influence of the particle size on the error rate for PFSOs. See the result in Fig.4. Once again we see that MCMCPFSO outperforms KSPFSO when takes very small values, i.e., when the posterior is undersampled.
The significant better performance of PFSOs revealed above is obtained at the cost of computing time, see Table 3. However, PFSOs can be markedly accelerated by parallelization. The particle generation and weighting steps are straightforward to parallelize, as they require only independent operations on each particle. The resampling procedure can also be parallelized as reported in murray2016parallel . The RaoBlackwellization technique provides another way to accelerate PF methods if there are analytic structures present in the model liu1998sequential ; sarkka2007rao .
6 Concluding Remarks
Motivated by the success and limitations of the KFtype IPMs (see Section 2), we developed two stochastic optimization algorithms, namely KSPFSO and MCMCPFSO, for largescale stochastic optimization. The PF methods are not new, since they have been developed for more than 20 years since the seminal paper gordon1993novel , while the PFSO methods presented here are new since they are distinctive from all existing stochastic optimization methods. The superiorities of our methods over existent ones are empirically demonstrated using both synthetic and reallife data sets. To summarize, our methods have four desirable properties: (a) they free users from computing gradients and remove the requirement that the objective function be differentiable; (b) they have no requirement on the form of the component functions of ; (c)they markedly outperform existent methods in terms of convergence rate and accuracy; (d) their performance can be easily tuned by adapting their particle size. Currently, this paper lacks a strict theoretical analysis for PFSOs. A practical way to do this is to take into account of theories on PF or Sequential Monte Carlo methods in e.g., crisan2002survey ; sarkka2013bayesian ; douc2014long ; hu2008basic ; hu2011general .
We argue that our work here opens the door to borrow a rich body of PF methods to solve largescale stochastic optimization problems. For example, we can borrow ideas from adaptive importance sampling, e.g., in cappe2008adaptive ; oh1992adaptive ; liu2014adaptive , to design adaptive PFSOs that can build up proposal functions automatically. Our recent work on robust PF liu2017ilapf ; liu2017robust ; dai2016robust ; liu2011instantaneous ; liu2019robust may provide tools for developing robust PFSOs.
References
 (1) S.J. Reddi, New Optimization Methods for Modern Machine Learning, Ph.D. thesis, Carnegie Mellon University, 2017.
 (2) L. Bottou, F.E. Curtis, and J. Nocedal, “Optimization methods for largescale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.
 (3) L. Bottou, “Largescale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010, pp. 177–186. Springer, 2010.

(4)
L. Bottou,
“Online learning and stochastic approximations,”
Online learning in neural networks
, vol. 17, no. 9–42, pp. 142, 1998. 
(5)
J. Duchi, E. Hazan, and Y. Singer,
“Adaptive subgradient methods for online learning and stochastic optimization,”
Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.  (6) T. Schaul, S. Zhang, and Y. LeCun, “No more pesky learning rates,” in International Conference on Machine Learning (ICML), 2013, pp. 343–351.
 (7) M. Mahsereci and P. Hennig, “Probabilistic line searches for stochastic optimization,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 181–189.
 (8) N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends® in Optimization, vol. 1, no. 3, pp. 127–239, 2014.
 (9) D.P. Bertsekas, “Incremental proximal methods for large scale convex optimization,” Mathematical programming, vol. 129, no. 2, pp. 163, 2011.
 (10) Ö.D. Akyıldız, V. Elvira, and J. Míguez, “The incremental proximal method: A probabilistic perspective,” in Proc. of IEEE Int’l Conf. on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2018, pp. 4279–4283.
 (11) Simon S Haykin and Simon S Haykin, Kalman filtering and neural networks, Wiley Online Library, 2001.
 (12) M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/nonGaussian Bayesian tracking,” IEEE Trans. on Signal Processing, vol. 50, no. 2, pp. 174–188, 2002.
 (13) R. Van Der Merwe, A. Doucet, N. De Freitas, and E.A. Wan, “The unscented particle filter,” in Advances in Neural Information Processing Systems (NIPS), 2001, pp. 584–590.
 (14) B. Liu, X. Ma, and C. Hou, “A particle filter using SVD based sampling Kalman filter to obtain the proposal distribution,” in Proc. of IEEE Conf. on Cybernetics and Intelligent Systems. IEEE, 2008, pp. 581–584.
 (15) J. Liu and M. West, “Combined parameter and state estimation in simulationbased filtering,” in Sequential Monte Carlo methods in practice, pp. 197–223. Springer, 2001.
 (16) N. Gordon, D. Salmond, and A. F. M. Smith, “Novel approach to nonlinear/nonGaussian Bayesian state estimation,” IEE Proceedings F (Radar and Signal Processing), vol. 140, no. 2, pp. 107–113, 1993.
 (17) B. Liu, C. Ji, X. Ma, and C. Hou, “Singletone frequency tracking using a particle filter with improvement strategies,” in Int’l Conf. on Audio, Language and Image Processing (ICALIP). IEEE, 2008, pp. 1615–1619.
 (18) W. R. Gilks and C. Berzuini, “Following a moving target–Monte Carlo inference for dynamic Bayesian models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 63, no. 1, pp. 127–146, 2001.
 (19) N. Chopin, “A sequential particle filter method for static models,” Biometrika, vol. 89, no. 3, pp. 539–552, 2002.
 (20) G. Ridgeway and D. Madigan, “A sequential Monte Carlo method for Bayesian analysis of massive datasets,” Data Mining and Knowledge Discovery, vol. 7, no. 3, pp. 301–319, 2003.
 (21) V. Patel, “Kalmanbased stochastic gradient method with stop condition and insensitivity to conditioning,” SIAM Journal on Optimization, vol. 26, no. 4, pp. 2620–2648, 2016.
 (22) B.M. Bell and F.W. Cathey, “The iterated kalman filter update as a gaussnewton method,” IEEE Transactions on Automatic Control, vol. 38, no. 2, pp. 294–297, 1993.
 (23) D.P. Bertsekas, “Incremental least squares methods and the extended kalman filter,” SIAM Journal on Optimization, vol. 6, no. 3, pp. 807–822, 1996.
 (24) Y.C. Ho, “On the stochastic approximation method and optimal filtering theory,” Journal of mathematical analysis and applications, vol. 6, pp. 152–154, 1962.
 (25) P. Stinis, “Stochastic global optimization as a filtering problem,” Journal of Computational Physics, vol. 231, no. 4, pp. 2002–2014, 2012.
 (26) B. Liu, “Posterior exploration based sequential Monte Carlo for global optimization,” Journal of Global Optimization, vol. 69, no. 4, pp. 847–868, 2017.
 (27) B. Liu, S. Cheng, and Y. Shi, “Particle filter optimization: A brief introduction,” in Int’l Conf. in Swarm Intelligence. Springer, 2016, pp. 95–104.
 (28) P. Del Moral, A. Doucet, and A. Jasra, “Sequential Monte Carlo samplers,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 3, pp. 411–436, 2006.
 (29) S.J. Julier and J.K. Uhlmann, “Unscented filtering and nonlinear estimation,” Proceedings of the IEEE, vol. 92, no. 3, pp. 401–422, 2004.
 (30) C.L. Blake, “Uci repository of machine learning databases,” http://www.ics.uci.edu/~mlearn/MLRepository.html, 1998.
 (31) T. Downs, K.E. Gates, and A. Masters, “Exact simplification of support vector solutions,” Journal of Machine Learning Research, vol. 2, pp. 293–297, 2001.
 (32) R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, and J. D. Knowles, “Fifty years of pulsar candidate selection: from simple filters to a new principled realtime classification approach,” Monthly Notices of the Royal Astronomical Society, vol. 459, no. 1, pp. 1104–1123, 2016.
 (33) L.M. Murray, A. Lee, and P.E. Jacob, “Parallel resampling in the particle filter,” Journal of Computational and Graphical Statistics, vol. 25, no. 3, pp. 789–805, 2016.
 (34) J.S. Liu and R. Chen, “Sequential Monte Carlo methods for dynamic systems,” Journal of the American statistical association, vol. 93, no. 443, pp. 1032–1044, 1998.
 (35) S. Särkkä, A. Vehtari, and J. Lampinen, “Raoblackwellized particle filter for multiple target tracking,” Information Fusion, vol. 8, no. 1, pp. 2–15, 2007.
 (36) D. Crisan and A. Doucet, “A survey of convergence results on particle filtering methods for practitioners,” IEEE Transactions on signal processing, vol. 50, no. 3, pp. 736–746, 2002.
 (37) S. Särkkä, Bayesian filtering and smoothing, vol. 3, Cambridge University Press, 2013.

(38)
R. Douc, E. Moulines, and J. Olsson,
“Longterm stability of sequential monte carlo methods under
verifiable conditions,”
The Annals of Applied Probability
, vol. 24, no. 5, pp. 1767–1802, 2014.  (39) X. Hu, T. Schön, and L. Ljung, “A basic convergence result for particle filtering,” IEEE Transactions on Signal Processing, vol. 56, no. 4, pp. 1337–1348, 2008.
 (40) X. Hu, T. Schön, and L. Ljung, “A general convergence result for particle filtering,” IEEE Transactions on Signal Processing, vol. 59, no. 7, pp. 3424–3429, 2011.
 (41) O. Cappé, R. Douc, A. Guillin, J. Marin, and C.P. Robert, “Adaptive importance sampling in general mixture classes,” Statistics and Computing, vol. 18, no. 4, pp. 447–459, 2008.
 (42) M. Oh and J.O. Berger, “Adaptive importance sampling in Monte Carlo integration,” Journal of Statistical Computation and Simulation, vol. 41, no. 34, pp. 143–168, 1992.
 (43) B. Liu, “Adaptive annealed importance sampling for multimodal posterior exploration and model selection with application to extrasolar planet detection,” The Astrophysical Journal Supplement Series, vol. 213, no. 14, pp. 1–16, 2014.
 (44) B. Liu, “ILAPF: Incremental learning assisted particle filtering,” in Proc. of IEEE Int’l Conf. on Acoustics, Speech and Signal Processing (ICASSP 2018), 2018, pp. 4284–4288.
 (45) B. Liu, “Robust particle filter by dynamic averaging of multiple noise models,” in Proc. of the 42nd IEEE Int’l Conf. on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2017, pp. 4034–4038.
 (46) Y. Dai and B. Liu, “Robust video object tracking via Bayesian model averagingbased feature fusion,” Optical Engineering, vol. 55, no. 8, pp. 1–11, 2016.
 (47) B. Liu, “Instantaneous frequency tracking under model uncertainty via dynamic model averaging and particle filtering,” IEEE Trans. on Wireless Communications, vol. 10, no. 6, pp. 1810–1819, 2011.

(48)
B. Liu,
“Robust particle filtering via Bayesian nonparametric outlier modeling,”
in International Conference on Information Fusion (FUSION), 2019.