Log In Sign Up

Particle Filtering Methods for Stochastic Optimization with Application to Large-Scale Empirical Risk Minimization

by   Bin Liu, et al.

There is a recent interest in developing statistical filtering methods for stochastic optimization (FSO) by leveraging a probabilistic perspective of the incremental proximity methods (IPMs). The existent FSO methods are derived based on the Kalman filter (KF) and extended KF (EKF). Different with classical stochastic optimization methods such as the stochastic gradient descent (SGD) and typical IPMs, such KF-type algorithms possess a desirable property, namely they do not require pre-scheduling of the learning rate for convergence. However, on the other side, they have inherent limitations inherited from the nature of KF mechanisms. It is a consensus that the class of particle filters (PFs) outperforms the KF and its variants remarkably for nonlinear and/or non-Gaussian statistical filtering tasks. Hence, it is natural to ask if the FSO methods can benefit from the PF theory to get around of the limitations of the KF-type IPMs. We provide an affirmative answer to the aforementioned question by developing three PF based SO (PFSO) algorithms. We also provide a discussion of relationships among (1) PF methods designed for stochastic dynamic filtering; (2) PF methods designed for static parameter estimation; and (3) our PFSO algorithms. For performance evaluation, we apply the proposed algorithms to solve a least-square fitting problem using simulated dataset, and the empirical risk minimization (ERM) problem in binary classification using real datasets. The experimental results demonstrate that our algorithms outperform remarkably existent methods in terms of numerical stability, convergence speed, classification error rate and flexibility in handling different types of models and loss functions.


Kalman Gradient Descent: Adaptive Variance Reduction in Stochastic Optimization

We introduce Kalman Gradient Descent, a stochastic optimization algorith...

Stochastic Gradient Variance Reduction by Solving a Filtering Problem

Deep neural networks (DNN) are typically optimized using stochastic grad...

A Unified Convergence Theorem for Stochastic Optimization Methods

In this work, we provide a fundamental unified convergence theorem used ...

Exponential Convergence Rates of Classification Errors on Learning with SGD and Random Features

Although kernel methods are widely used in many learning problems, they ...

KaFiStO: A Kalman Filtering Framework for Stochastic Optimization

Optimization is often cast as a deterministic problem, where the solutio...

SGD for robot motion? The effectiveness of stochastic optimization on a new benchmark for biped locomotion tasks

Trajectory optimization and posture generation are hard problems in robo...

Stochastic Variance Reduced Primal Dual Algorithms for Empirical Composition Optimization

We consider a generic empirical composition optimization problem, where ...

1 Introduction

In this paper, we consider a type of optimization problem that arises in supervised machine learning (ML), which can be formulated as follows,


where denotes the training data points,

the feature vector,

the label, and the loss function. In the context of ML, this problem is known as empirical risk minimization (ERM). We consider cases in which the number of training data points is so large that it is infeasible to use classical first or second order optimization methods. These methods require gradients of , to be evaluated and stored at each iteration. When is large, the cost for evaluating and storing these gradients is too expensive.

The massive scale of modern data sets poses a significant research challenge for developing efficient algorithms to solve such large-scale ERM problems reddi2017new ; bottou2018optimization . A commonly used approach is SGD, which is strongly connected with the stochastic approximation methods. SGD can be regarded as a stochastic version of the gradient descent methods bottou2010large ; bottou1998online . The key idea is to substitute the expensive gradient calculation with cheap estimations of the gradient. So the question is how to get a cheap while effective estimation of the true gradient? The common practice in SGD is to randomly choose a subset of the training data points, and then calculate an estimate of the true gradient based on the chosen data points. The estimated gradient is then leveraged for parameter updating at each iteration. The main advantage of SGD consists of the low per iteration computation and memory requirement and a frequently fast initial convergence rate, while the success of SGD demands an elaborately designed scheduling of the step-size to ensure convergence. Despite that a significant amount of work has been conducted, see e.g. duchi2011adaptive ; schaul2013no ; mahsereci2015probabilistic , there still lacks a generally applicable approach to tune the step-size in a theoretically sound way.

As an alternative to SGD, the class of IPMs has also attracted a lot of attentions in the literature parikh2014proximal ; bertsekas2011incremental ; aky2018the . The key procedure of IPMs is an online version of the proximal operator. This operator minimizes a single or a mini-batch of components of the cost function by a local search over a constrained parameter space, centered around the solution given at the previous iteration. This local parameter space is specified in the form of a regularizer. Compared with SGD, IPMs are preferable for linear cases, since an analytical form of the iterative solution is available parikh2014proximal ; bertsekas2011incremental . For nonlinear cases, IPMs are prone to become computationally inefficient, as no tractable analytical solution exists, and an iterative numerical solver is required to perform a local search at every proximal step parikh2014proximal ; bertsekas2011incremental . In addition, the IPM suffers from numerical instability when the parameter estimate gets close to the actual minimum aky2018the . This is caused by the lack of a mechanism to reduce the step-size in local parameter space exploration.

Recently KFs have been explored to improve typical IPMs based on a probabilistic explanation of the proximity operator. It is revealed that the proximal operator is in theory closely related with Bayes update aky2018the . Motivated by this observation, a KF based IPM (KF-IPM) is derived for online least-square fitting of linear Gaussian models. If the model to be fitted is nonlinear, the traditional KF can be replaced with EKF. The resulting algorithm is termed EKF-IPM here. The usage of KF iterations in IPMs provides a natural dampening mechanism for parameter updates, making the resulting KF-IPM and EKF-IPM much more numerically stable than traditional IPMs aky2018the . However, KF-IPM and EKF-IPM have severe problematic issues, inherited from the intrinsic limitations of KF and EKF. The first issue is that the working of KF-IPM and EKF-IPM requires an ad hoc form of , namely . Only when this requirement is satisfied, the closed-form KF formulas can be derived. Another problematic issue is that EKF-IPM is likely to be divergent when is highly nonlinear, due to a significant model mismatch caused by the employed first order linearization of . As we know, many ML tasks use different forms of nonlinear loss functions, e.g., the logistic function . Here denotes the transposition of (we assume that both and are column vectors herein and after). The presence of the aforementioned problematic issues prevents KF-IPM and EKF-IPM from being widely used for addressing real ML tasks.

It is a consensus that the class of particle filters (PFs) outperforms the KF and its variants remarkably for nonlinear and/or non-Gaussian statistical filtering. Hence, it is natural to ask if stochastic optimization can benefit from the PF theory to get around of the limitations of KF-IPM and EKF-IPM. We provide an affirmative answer to the aforementioned research question by deriving two novel PF based stochastic optimizers (PFSOs). We also provide a discussion on connections our methods have to: (a) PF methods for stochastic dynamic filtering; (b) PF methods for static parameter estimation; and (c) existent PF methods for optimization.

The remainder of this paper is organized as follows. In Section 2, we briefly review the link between KF and IPMs. In Section 3, we present the proposed PFSOs in detail. In Section 4, we discuss connections our methods have to other related work. In Section 5, we show experimental evidence on superiorities of our methods. Finally, we conclude the paper in Section 6.

2 Revisit the link between KF and IPMs

Here we briefly review the link between KFs and IPMs. For more details, readers are referred to aky2018the and references therein. This review provides the necessary background information required for developing the proposed PFSO methods in Section 3.

IPMs solve problems of the form of (1) in an iterative way. Each iteration produces an updated estimate of . Denote the estimate of generated at iteration as . Given , only a single component of , namely , is involved for generating . Specifically, is obtained through solving the following problem


where denotes the proximal operator, a regularization parameter, is a symmetric positive definite matrix, , the inverse of . The same as in aky2018the , we slightly abuse the notation here for simplicity, which actually stands for , where is a random sample drawn uniformly from . For cases in which , the solution to (2) is shown to be aky2018the :


Now we define a model that consists of a prior density function and a likelihood function as follows,



denotes a Gaussian distribution with mean vector

and covariance matrix . Let . Since both the prior density and the likelihood function are Gaussian, the posterior distribution is also Gaussian. Let . Then we have haykin2001kalman ; aky2018the :


The above equations (5)-(6) constitute the recursion of KF-IPM at iteration . The only difference between (5) and (3) is that is substituted with . The corresponding proximity operator addressed by KF-IPM can be formulated as


By comparing (7) with (3), we see that KF-IPM stands for a special class of IPM that can adaptively tune over iterations. It has been demonstrated that, by adapting according to (6), the KF-IPM is markedly more numerical stable than typical IPMs aky2018the .

The above KF-IPM recursion is only applicable for cases in which , and is of the form of , . The EKF-IPM is derived to handle cases in which , is of the form of , , is nonlinear and differentiable. The recursion of EKF-IPM is aky2018the


where , denotes the gradient operator with respect to .

Despite advantages over traditional IPMs aky2018the , such KF-type IPMs have two unresolved issues that prevent them from being widely applied. The first one is the restriction on the form of . It should have a form , which is required for deriving the KF or EKF recursions. In ML applications, different forms of nonlinear loss functions, e.g., the logistic function , are often used. For such applications, neither KF-IPM nor EKF-IPM can be used. The second issue is that, if is highly nonlinear, the estimate given by EKF-IPM is likely to be divergent due to a significant model mismatch caused by the linearization operation on . In the next section, we present PFSOs that resolve the above issues in an elegant way.

3 The Proposed Particle based Stochastic Optimization Methods

Motivated by both the advantages and the limitations of KF-IPMs, as revisited in Section 2, we explore PF as an alternative of KF for addressing the stochastic optimization problem formulated in (1). We first present a generic particle based scheme that extends KF-IPM in a straightforward way. Then we introduce the proposed PFSOs to implement the above scheme.

3.1 A Particle Scheme for Stochastic Optimization

Consider a model defined by a prior density function and a likelihood function as follows,


The prior density in the above model is the same as in (4). Recall that, the KF-IPM is derived based on the model in (4). At iteration , KF-IPM outputs as an updated estimate of , and, based on the model (4), happens to be the mean of the posterior . Different from the model in (4), here we allow to be of any form, e.g, the logistic form . For the logistic case, neither KF-IPM nor EKF-IPM is applicable for calculating the mean of the posterior, because their recursion equations cannot be derived again. We resort to particle based methods to estimate the mean of the posterior .

The basic idea is to run a sequential importance sampling (SIS) procedure to simulate , then seek an estimate of using the samples yielded from the simulation of . The SIS procedure is the backbone of all PF methods. Following the standard in PF related literature, we call sampled values of as “particles” in what follows. Suppose that, standing at the beginning of iteration , we have a weighted particle set , that provides a Monte Carlo approximation to , namely


where denotes the delta function located at . Then a two-stage operation is performed in the SIS procedure. First, draw a set of new particles from a proposal function : . Then, calculate importance weights of these particles as follows


Under mild conditions and with an appropriate design of the proposal function, this updated particle set can provide a satisfactory Monte Carlo approximation to arulampalam2002tutorial , namely,


Then can be calculated as below


The SIS algorithm has a seriously problematic issue, namely the variance of the importance weights increases stochastically over iterations

arulampalam2002tutorial . The variance increase will cause the phenomenon of particle degeneracy, which means that, after a few iterations, one of the normalized importance weights approaches one, while the others tend to zero. To reduce particle degeneracy, a resampling procedure is usually used to eliminate samples with low importance weights and duplicate samples with high importance weights arulampalam2002tutorial . A number of resampling schemes, such as residual resampling and minimum variance sampling, have been proposed in the literature, but it has been reported that their impacts on the final performance are not significantly different among each other van2001unscented . We used residual resampling in all our experiments in Section 5. A pseudo-code illustrating the PF scheme for stochastic optimization (PFSO) is presented in Algorithm 1.

1:  Initialization: Draw a random sample from . Set , . (Here and in what follows, ‘’ means ‘for all in ’ 
2:  for   do
3:     Sample , ;
4:     Calculate the importance weights of the samples: , ;
5:     Normalize the importance weights: , ;
6:     Set ;
7:     Resampling step: Eliminate/Duplicate samples with low/high importance weights, respectively, to obtain random samples approximately distributed according to ; Set .
8:  end for
Algorithm 1 PFSO: A generic PF scheme for stochastic optimization

The above PFSO scheme provides the basis for developing PF based stochastic optimization algorithms, while a critical issue, namely the choice of the proposal function , has not been addressed so far. For sampling from , we hope that, by choosing an appropriate , the variance of the importance weights can tend to zero. In fact, the choice of the proposal function plays an important role in reducing the variance of the importance weights van2001unscented ; liu2008particle ; arulampalam2002tutorial . An empirical guideline is to choose one that mimics the target distribution but has heavier tails. In the following subsections, we present two algorithm designs to implement the PFSO scheme.

3.2 Kernel Smoothing based PFSO

Here we adopt a kernel smoothing technique, referred to as Liu and West method liu2001combined in the literature, to generate new particles in the context of PFSO. The resulting algorithm is termed kernel smoothing based PFSO (KS-PFSO). Suppose that, standing at the beginning of iteration , we have at hand a weighted particle set , that satisfies . We calculate the mean and variance of with particle approximation. Denote the approximated mean and variance by and , respectively. Then sample and set


denotes a -dimensional zero valued vector. and are free parameters chosen to satisfy , which ensures the mean and variance of these new-born particles to be correct liu2001combined . The operation in (16) brings two desirable results. First, the effect of particle rejuvenation is achieved. Second, it retains the mean of the particles and meanwhile avoids over-dispersion of these new particles liu2001combined . We see from (16), when the value of approaches 1, the position of is close to , . Now we let take a value close to 1, then the corresponding proposal from which the new particles are drawn is approximately equivalent to . According to (12), the importance weights of the new-born particles can be obtained as follows


Note that, if a resampling operation is performed at the end of iteration , then all ’s has a value . So (17) can be substituted with a simpler operation . The KS-PFSO algorithm is summarized as follows in Algorithm 2.

1:  Initialization: Draw a random sample from . Set , and , . (Here and in what follows, ‘’ means ‘for all in ’.
2:  for  do
3:     Sample using Eqn.(16), .
4:     Calculate the importance weights of the samples: , .
5:     Normalize the weights: , .
6:     Set .
7:     Set , .
8:     Resampling step: the same as in Algorithm 1.
9:  end for
Algorithm 2 The KS-PFSO Algorithm

3.3 MCMC Assisted PFSO

In KS-PFSO, as shown in Algorithm 2, a resampling procedure is used to reduce particle degeneracy. This procedure results in multiple copies of the fittest particles and removal of low weight particles, which may lead to a phenomenon called particle impoverishment arulampalam2002tutorial . The extreme case is that one particle contributes

copies while all the other particles are deleted. Here we propose a Markov Chain Monte Carlo (MCMC) assisted PFSO (MCMC-PFSO), in which an MCMC moving step is used to strengthen particle diversity after the resampling step, while not affecting the validity of the approximation. The pseudo-code of the MCMC-PFSO algorithm is presented in Algorithm

3. Empirical results in Section 5 show that MCMC-PFSO is preferable to KS-PFSO when the particle size takes a small value, corresponding to cases wherein the posterior distribution is under-sampled.

1:  Initialization: the same as in Algorithm 2
2:  for   do
3:     Sample using Eqn.(16), ;
4:     Calculate the importance weights of the samples: , ;
5:     Normalize the importance weights: , ;
6:     Resampling step: the same as in Algorithm 1;
7:     MCMC move step: for , perform the following three operations: (a) Sample . Here

denotes a uniform distribution over [0,1]; (b) Sample a new particle

from a symmetric proposal, namely a Gaussian centered at ; (c) If , then set ;
8:     Set .
9:     Set , .
10:  end for
Algorithm 3 The MCMC-PFSO Algorithm

4 Connections to Related Algorithms

Here we discuss connections our methods have to other related work.

4.1 Connections to PF methods for dynamic state filtering

Most of PF algorithms are developed in the context of dynamic state filtering gordon1993novel ; arulampalam2002tutorial . In these methods, a state transition prior is usually precisely defined and then adopted as the proposal to generate new particles. For PFSOs, there is no state transition prior function defined, and the new particles are sampled by using a kernel smoothing technique termed Liu and West method liu2001combined . In PFSOs, the proposal distribution, from which new born particles are sampled, is actually an approximation of the posterior in the previous iteration. In addition, the MCMC-PFSO is related with the improved PF methods presented in liu2008single ; gilks2001following , which also adopt the MCMC moving step to strengthen particle diversity.

The fundamental difference between PFSOs and PF methods for dynamic state filtering can be summarized as follows. In the context of the former, is treated as an unknown static parameter; while, in the context of the latter, what to be estimated is a dynamic state variable that changes over time.

4.2 Connections to PF methods for static parameter estimation

The PFSOs proposed here have connections to existent PF methods derived for static model parameter estimation, the most representative of which are those presented in chopin2002sequential ; ridgeway2003sequential . Specifically, both MCMC-PFSO and methods of chopin2002sequential ; ridgeway2003sequential use an MCMC moving step to bypass particle impoverishment after the resampling step. The difference lies in that, the former employs Liu and West method to generate new particles besides the MCMC moving step, while the latter relies completely on the MCMC moving step to generate new particles. In addition, in the context of PFSOs, the index of the data point that arrives at iteration is , which is randomly and uniformly drawn from the index set . For methods in chopin2002sequential ; ridgeway2003sequential , the data point processed at iteration is deterministically the th item in the training data set.

4.3 Connections to filtering methods for optimization

As a type of recursive filter based stochastic optimization methods, our algorithms have connections to all KF-type methods derived for stochastic optimization, e.g., in patel2016kalman ; aky2018the ; bell1993iterated ; bertsekas1996incremental ; ho1962stochastic . In this paper, we substitute KFs with specially designed PFs in the context of stochastic optimization. With aid of PF, our methods allow any form of the loss function to be used, while the KF-type stochastic optimization methods require an ad hoc type of loss functions to be used. In addition, our methods can handle strong nonlinearity in the model elegantly, while all Kalman-type methods perform unsatisfactorily in addressing highly nonlinear models.

Our methods also have connections to several PF based optimization (PFO) methods in stinis2012stochastic ; liu2017posterior ; liu2016particle , as they all fall within an algorithmic framework termed Sequential Monte Carlo (SMC) sampler del2006sequential . In this framework, the task of optimizing an objective function is translated to be how to sample from a sequence of target distributions , and then evaluate the optimum based on the samples. A basic difference between existent PFO methods and PFSOs proposed here is as follows. To evaluate a candidate solution , the former requires a full computation of , which is too expensive for the case considered here, as shown in (1). This is because the computation of requires access to all training data points and computations of all component functions of . In contrast, our methods only deals with a single component function per iteration, which only requires access to a single training data point.

5 Experimental Evaluations

We seek to experimentally validate two claims about our methods. The first claim is that PFSOs proposed here outperform both KF-type and typical IPMs in handling nonlinear models in the context of stochastic optimization. We tested this claim across 2 synthetic nonlinear least-square fitting applications and 6 real-world binary classification applications. The second claim is that PFSOs have the flexibility in handling different types of loss functions with performance guarantee. We tested this claim using 6 real-world binary classification applications. All objective functions to be optimized in our experiments are nonlinear.

5.1 Synthetic data experiments

We considered two nonlinear model fitting experiments, in which the EKF-IPM, UKF-IPM and the typical IPM are involved for performance comparison. The term UKF-IPM is the abbreviation of unscented Kalman filter (UKF) based IPM. It is obtained by substituting the EKF recursion, namely (8)-(9), with an UKF recursion. Readers are referred to julier2004unscented for details about UKF.

5.1.1 Least-square fitting of a sigmoid function

The setting of this experiment is borrowed from aky2018the . The cost function to be minimized is of the form (1) with


where , , , , and . We simulated training data points for use, in which and, given , is set to be , where , denotes a by identity matrix. In this experiment, we set and at 2 and 0.1, respectively. The initial estimate of for IPM is randomly selected from a uniform distribution that centered around . As the model is nonlinear, the typical IPM used here is actually an approximate nonlinear IPM, which applies an iterative numerical solver at each iteration. We set the values of all initial particles for PFSOs identically to be . The EKF-IPM and UKF-IPM are initialized with , where . As all methods involved here adopt the same initial estimate of , a fair comparison of these methods is thus guaranteed. The particle size in PFSOs is set at . Given an estimate of yielded at iteration , the corresponding cost or loss value is defined to be . We use a normalized cost (NC) as the performance metric, defined as .

For each algorithm, we performed 30 independent runs of it and calculated an averaged , , over these runs. The experimental result is plotted in Fig.1. As is shown, KS-PFSO and MCMC-PFSO have a faster initial convergence rate in the first 10 iterations. They perform comparatively with EKF-IPM and UKF-IPM in terms of the final cost. All these KF-type and particle based methods are numerically stable, while the typical IPM suffers from instability in iterations of the final phase. Fig.2 illustrates how the performance of PFSOs changes along with the particle size . It shows that, when achieves 500, the performance of these particle based methods tends to its upper limit. In addition, we see that MCMC-PFSO performs better than KS-PFSO when takes very small values, corresponding to cases wherein the posterior is under-sampled.

Figure 1:

Fitting a sigmoid function using the typical IPM, EKF-IPM, UKF-IPM and the proposed PFSOs. 30 independent runs of each algorithm are conducted. This figure shows the result averaged over these runs. The upper panel shows the evolution of the normalized cost over the iterations. We see that KS-PFSO and MCMC-PFSO perform comparatively with EKF-IPM and UKF-IPM, and significantly better than the typical IPM. The typical IPM suffers from instability in iterations of the final phase. The lower panel depicts the sum of the absolute values of the entries of

per iterations, which implies that KS-PFSO and MCMC-PFSO converge faster than the other methods.

Figure 2: The particle size of the PFSOs vs. the achieved minimum cost. This result is obtained for the experimental case presented in Subsection 5.1.1.

5.1.2 Least-square fitting of a highly nonlinear model

We conducted another least-square fitting experiment to further test the methods involved here. The experiment setting is the same as in subsection 5.1.1 except for the formulation of , which is now defined to be:


where , . For each algorithm, 30 independent runs of it are conducted. The averaged , , over these runs is calculated and plotted in Fig.3. We see that, for this highly nonlinear case, KS-PFSO performs significantly better than the other competitors in terms of minimizing the cost. The typical IPM again suffers from instability in iterations of the final phase.

Figure 3: Fitting a more complex and highly nonlinear function as presented in subsection 5.1.2. 30 independent runs of each algorithm are conducted. This figure shows the result averaged over these runs. The upper panel shows the evolution of the normalized cost over the iterations. The lower panel depicts the sum of the absolute values of the entries of per iterations. We see that KF-PFSO performs best, KS-PFSO and MCMC-PFSO converge faster than the others, the typical IPM suffers again from instability in iterations of the final phase.

5.2 Binary classification using real data sets

We then tested the proposed PFSOs on 6 UCI data sets blake1998uci : Haberman downs2001exact , HTRU2 lyon2016fifty , IRIS, Banknote Authentication, Pima Indians Diabetes, Skin Segmentation. Table 1 gives a summary of these data sets. In our tests, the typical IPM, UKF-IPM, and a typical SGD algorithm termed Adaline bottou1998online are included for reference. We focused on ERM for binary classification. The objective function is (1), where is the label and the feature vector. Except IRIS, each data set consists of two classes of data instances. IRIS has three classes. We covert IRIS into a two-class data set by treating its first two classes as one class, and regarding the last class in the original data set as the other class. We considered two loss functions for use, namely the least-quadratic (LQ) function with where , and the logistic function with . For both cases, we have .

Name of data set Haberman HTRU2 IRIS Banknote Pima Skin
Number of Instances 306 17898 150 1372 768 245057
Number of Attributes 3 9 4 5 8 4
Table 1: Benchmark datsets from UCI
Algorithm Loss function Haberman HTRU2 IRIS Banknote Pima Skin
Adaline LQ 0.2647 0.0899 0.3333 0.3943 0.3490 0.2187
IPM LQ 0.2647 0.0514 0.3333 0.0255 0.3490 0.0600
EKF-IPM LQ 0.2647 0.0514 0.3267 0.0445 0.3490 0.0849
UKF-IPM LQ 0.2647 0.0514 0.3333 0.0452 0.3490 0.0845
KS-PFSO LQ 0.2647 0.0447 0.0933 0.0233 0.3060 0.0592
Logistic 0.2549 0.0363 0.0533 0.0561 0.2708 0.1110
MCMC-PFSO LQ 0.2647 0.0363 0.1000 0.0241 0.3021 0.0726
Logistic 0.2582 0.0397 0.0533 0.0437 0.2839 0.1098
Table 2: Error rate results on 6 benchmark data sets from UCI. The best result given by these algorithms is indicated with boldface font.
Scaled time 1 2.6686 2.6544 21.7107
Table 3: Scaled computing time comparison for the algorithms involved. This experiment is conducted with a one-core Intel i5-3210M 2.50 GHz processor. The particle size for all PFSOs is set at 1000.

For algorithm performance evaluation, we use 10-fold cross-validation. Specifically speaking, we split a data set randomly into 10 sub data sets, figure out the through ERM using 9 sub data sets of them, reserving the remaining one sub data set as a test set. This process is repeated 10 times, every sub data set having one chance to act as the test set. Given the optimum that has been found denoted by , we predict the label for a data instance denoted by checking the value of . If this value is bigger than 0.5, then we set the predicted label to 1; otherwise to -1. By comparing the predicted labels with the true labels for data items in the test set, we obtain the error rate, which we use here as the performance metric for algorithm comparison.

For every data set, we do the same initialization for each algorithm. The regularization parameter is set to 0.25. The particle size for PFSOs is set at . All the other parameters are initialized in the same way as that shown in Subsection 5.1. The learning rate parameter in Adaline is set to , where denotes the iteration index.

Table 2 presents the experimental result. It shows that, for every data set, the best classification result in terms of error rate is always given by KS-PFSO. For datasets Haberman, HTRU2, IRIS and Pima, the logistic loss function leads to better performance than the LQ type loss function in terms of error rate. For the other two dataset, namely Banknote and Skin, the LQ type loss function is preferable to the logistic loss function for use with KS-PFSO. In addition, we can see that the performance of UKF-IPM is indistinguishable from that of EKF-IPM. We also checked the influence of the particle size on the error rate for PFSOs. See the result in Fig.4. Once again we see that MCMC-PFSO outperforms KS-PFSO when takes very small values, i.e., when the posterior is under-sampled.

Figure 4: The particle size of the PFSOs vs. the error rate. This experiment is conducted using the UCI benchmark data set termed Pima-Indians-diabetes. The abbreviations “LQ” and “LG” represent “least-quadratic” and “Logistic”, respectively.

The significant better performance of PFSOs revealed above is obtained at the cost of computing time, see Table 3. However, PFSOs can be markedly accelerated by parallelization. The particle generation and weighting steps are straightforward to parallelize, as they require only independent operations on each particle. The resampling procedure can also be parallelized as reported in murray2016parallel . The Rao-Blackwellization technique provides another way to accelerate PF methods if there are analytic structures present in the model liu1998sequential ; sarkka2007rao .

6 Concluding Remarks

Motivated by the success and limitations of the KF-type IPMs (see Section 2), we developed two stochastic optimization algorithms, namely KS-PFSO and MCMC-PFSO, for large-scale stochastic optimization. The PF methods are not new, since they have been developed for more than 20 years since the seminal paper gordon1993novel , while the PFSO methods presented here are new since they are distinctive from all existing stochastic optimization methods. The superiorities of our methods over existent ones are empirically demonstrated using both synthetic and real-life data sets. To summarize, our methods have four desirable properties: (a) they free users from computing gradients and remove the requirement that the objective function be differentiable; (b) they have no requirement on the form of the component functions of ; (c)they markedly outperform existent methods in terms of convergence rate and accuracy; (d) their performance can be easily tuned by adapting their particle size. Currently, this paper lacks a strict theoretical analysis for PFSOs. A practical way to do this is to take into account of theories on PF or Sequential Monte Carlo methods in e.g., crisan2002survey ; sarkka2013bayesian ; douc2014long ; hu2008basic ; hu2011general .

We argue that our work here opens the door to borrow a rich body of PF methods to solve large-scale stochastic optimization problems. For example, we can borrow ideas from adaptive importance sampling, e.g., in cappe2008adaptive ; oh1992adaptive ; liu2014adaptive , to design adaptive PFSOs that can build up proposal functions automatically. Our recent work on robust PF liu2017ilapf ; liu2017robust ; dai2016robust ; liu2011instantaneous ; liu2019robust may provide tools for developing robust PFSOs.


  • (1) S.J. Reddi, New Optimization Methods for Modern Machine Learning, Ph.D. thesis, Carnegie Mellon University, 2017.
  • (2) L. Bottou, F.E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311, 2018.
  • (3) L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010, pp. 177–186. Springer, 2010.
  • (4) L. Bottou, “Online learning and stochastic approximations,”

    On-line learning in neural networks

    , vol. 17, no. 9–42, pp. 142, 1998.
  • (5) J. Duchi, E. Hazan, and Y. Singer,

    Adaptive subgradient methods for online learning and stochastic optimization,”

    Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
  • (6) T. Schaul, S. Zhang, and Y. LeCun, “No more pesky learning rates,” in International Conference on Machine Learning (ICML), 2013, pp. 343–351.
  • (7) M. Mahsereci and P. Hennig, “Probabilistic line searches for stochastic optimization,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 181–189.
  • (8) N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends® in Optimization, vol. 1, no. 3, pp. 127–239, 2014.
  • (9) D.P. Bertsekas, “Incremental proximal methods for large scale convex optimization,” Mathematical programming, vol. 129, no. 2, pp. 163, 2011.
  • (10) Ö.D. Akyıldız, V. Elvira, and J. Míguez, “The incremental proximal method: A probabilistic perspective,” in Proc. of IEEE Int’l Conf. on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2018, pp. 4279–4283.
  • (11) Simon S Haykin and Simon S Haykin, Kalman filtering and neural networks, Wiley Online Library, 2001.
  • (12) M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” IEEE Trans. on Signal Processing, vol. 50, no. 2, pp. 174–188, 2002.
  • (13) R. Van Der Merwe, A. Doucet, N. De Freitas, and E.A. Wan, “The unscented particle filter,” in Advances in Neural Information Processing Systems (NIPS), 2001, pp. 584–590.
  • (14) B. Liu, X. Ma, and C. Hou, “A particle filter using SVD based sampling Kalman filter to obtain the proposal distribution,” in Proc. of IEEE Conf. on Cybernetics and Intelligent Systems. IEEE, 2008, pp. 581–584.
  • (15) J. Liu and M. West, “Combined parameter and state estimation in simulation-based filtering,” in Sequential Monte Carlo methods in practice, pp. 197–223. Springer, 2001.
  • (16) N. Gordon, D. Salmond, and A. F. M. Smith, “Novel approach to nonlinear/non-Gaussian Bayesian state estimation,” IEE Proceedings F (Radar and Signal Processing), vol. 140, no. 2, pp. 107–113, 1993.
  • (17) B. Liu, C. Ji, X. Ma, and C. Hou, “Single-tone frequency tracking using a particle filter with improvement strategies,” in Int’l Conf. on Audio, Language and Image Processing (ICALIP). IEEE, 2008, pp. 1615–1619.
  • (18) W. R. Gilks and C. Berzuini, “Following a moving target–Monte Carlo inference for dynamic Bayesian models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 63, no. 1, pp. 127–146, 2001.
  • (19) N. Chopin, “A sequential particle filter method for static models,” Biometrika, vol. 89, no. 3, pp. 539–552, 2002.
  • (20) G. Ridgeway and D. Madigan, “A sequential Monte Carlo method for Bayesian analysis of massive datasets,” Data Mining and Knowledge Discovery, vol. 7, no. 3, pp. 301–319, 2003.
  • (21) V. Patel, “Kalman-based stochastic gradient method with stop condition and insensitivity to conditioning,” SIAM Journal on Optimization, vol. 26, no. 4, pp. 2620–2648, 2016.
  • (22) B.M. Bell and F.W. Cathey, “The iterated kalman filter update as a gauss-newton method,” IEEE Transactions on Automatic Control, vol. 38, no. 2, pp. 294–297, 1993.
  • (23) D.P. Bertsekas, “Incremental least squares methods and the extended kalman filter,” SIAM Journal on Optimization, vol. 6, no. 3, pp. 807–822, 1996.
  • (24) Y.C. Ho, “On the stochastic approximation method and optimal filtering theory,” Journal of mathematical analysis and applications, vol. 6, pp. 152–154, 1962.
  • (25) P. Stinis, “Stochastic global optimization as a filtering problem,” Journal of Computational Physics, vol. 231, no. 4, pp. 2002–2014, 2012.
  • (26) B. Liu, “Posterior exploration based sequential Monte Carlo for global optimization,” Journal of Global Optimization, vol. 69, no. 4, pp. 847–868, 2017.
  • (27) B. Liu, S. Cheng, and Y. Shi, “Particle filter optimization: A brief introduction,” in Int’l Conf. in Swarm Intelligence. Springer, 2016, pp. 95–104.
  • (28) P. Del Moral, A. Doucet, and A. Jasra, “Sequential Monte Carlo samplers,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 3, pp. 411–436, 2006.
  • (29) S.J. Julier and J.K. Uhlmann, “Unscented filtering and nonlinear estimation,” Proceedings of the IEEE, vol. 92, no. 3, pp. 401–422, 2004.
  • (30) C.L. Blake, “Uci repository of machine learning databases,”, 1998.
  • (31) T. Downs, K.E. Gates, and A. Masters, “Exact simplification of support vector solutions,” Journal of Machine Learning Research, vol. 2, pp. 293–297, 2001.
  • (32) R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, and J. D. Knowles, “Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach,” Monthly Notices of the Royal Astronomical Society, vol. 459, no. 1, pp. 1104–1123, 2016.
  • (33) L.M. Murray, A. Lee, and P.E. Jacob, “Parallel resampling in the particle filter,” Journal of Computational and Graphical Statistics, vol. 25, no. 3, pp. 789–805, 2016.
  • (34) J.S. Liu and R. Chen, “Sequential Monte Carlo methods for dynamic systems,” Journal of the American statistical association, vol. 93, no. 443, pp. 1032–1044, 1998.
  • (35) S. Särkkä, A. Vehtari, and J. Lampinen, “Rao-blackwellized particle filter for multiple target tracking,” Information Fusion, vol. 8, no. 1, pp. 2–15, 2007.
  • (36) D. Crisan and A. Doucet, “A survey of convergence results on particle filtering methods for practitioners,” IEEE Transactions on signal processing, vol. 50, no. 3, pp. 736–746, 2002.
  • (37) S. Särkkä, Bayesian filtering and smoothing, vol. 3, Cambridge University Press, 2013.
  • (38) R. Douc, E. Moulines, and J. Olsson, “Long-term stability of sequential monte carlo methods under verifiable conditions,”

    The Annals of Applied Probability

    , vol. 24, no. 5, pp. 1767–1802, 2014.
  • (39) X. Hu, T. Schön, and L. Ljung, “A basic convergence result for particle filtering,” IEEE Transactions on Signal Processing, vol. 56, no. 4, pp. 1337–1348, 2008.
  • (40) X. Hu, T. Schön, and L. Ljung, “A general convergence result for particle filtering,” IEEE Transactions on Signal Processing, vol. 59, no. 7, pp. 3424–3429, 2011.
  • (41) O. Cappé, R. Douc, A. Guillin, J. Marin, and C.P. Robert, “Adaptive importance sampling in general mixture classes,” Statistics and Computing, vol. 18, no. 4, pp. 447–459, 2008.
  • (42) M. Oh and J.O. Berger, “Adaptive importance sampling in Monte Carlo integration,” Journal of Statistical Computation and Simulation, vol. 41, no. 3-4, pp. 143–168, 1992.
  • (43) B. Liu, “Adaptive annealed importance sampling for multimodal posterior exploration and model selection with application to extrasolar planet detection,” The Astrophysical Journal Supplement Series, vol. 213, no. 14, pp. 1–16, 2014.
  • (44) B. Liu, “ILAPF: Incremental learning assisted particle filtering,” in Proc. of IEEE Int’l Conf. on Acoustics, Speech and Signal Processing (ICASSP 2018), 2018, pp. 4284–4288.
  • (45) B. Liu, “Robust particle filter by dynamic averaging of multiple noise models,” in Proc. of the 42nd IEEE Int’l Conf. on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2017, pp. 4034–4038.
  • (46) Y. Dai and B. Liu, “Robust video object tracking via Bayesian model averaging-based feature fusion,” Optical Engineering, vol. 55, no. 8, pp. 1–11, 2016.
  • (47) B. Liu, “Instantaneous frequency tracking under model uncertainty via dynamic model averaging and particle filtering,” IEEE Trans. on Wireless Communications, vol. 10, no. 6, pp. 1810–1819, 2011.
  • (48) B. Liu,

    “Robust particle filtering via Bayesian nonparametric outlier modeling,”

    in International Conference on Information Fusion (FUSION), 2019.