Parallel sequential Monte Carlo for stochastic optimization

11/23/2018 ∙ by Ömer Deniz Akyıldız, et al. ∙ 0

We propose a parallel sequential Monte Carlo optimization method to minimize cost functions which are computed as the sum of many component functions. The proposed scheme is a stochastic zeroth order optimization algorithm which uses only evaluations of small subsets of component functions to collect information from the problem. The algorithm consists of a bank of samplers and generates particle approximations of several sequences of probability measures. These measures are constructed in such a way that they have associated probability density functions whose global maxima coincide with the global minima of the cost function. The algorithm selects the best performing sampler and uses it to approximate a global minimum of the cost function. We prove analytically that the resulting estimator converges to a global minimum of the cost function almost surely as the number of Monte Carlo samples tends to infinity. We show that the algorithm can tackle cost functions with multiple minima or with wide flat regions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In signal processing and machine learning, optimization problems of the form

(1)

where is the -dimensional search space, have attracted significant attention in recent years for problems where is very large. Such problems often arise in big data settings, e.g., when one needs to estimate parameters given a large number of observations. Because of their efficiency, the optimization community has focused mainly on stochastic gradient based methods [1, 2, 3] (see [4] for a recent review of the field) where an estimate of the gradient is obtained using a randomly selected subsample of the gradients of the component functions ( in (1)) at each iteration. The resulting estimate is then used to perform a stochastic descent step. The majority of these stochastic gradient methods construct the subsamples using sampling with replacement to obtain unbiased estimates of the gradient. The latter can then be seen as a noisy gradient estimate with additive, zero-mean noise. In practice, however, there are schemes that subsample the data set without replacement (hence producing biased gradient estimators) and it has been argued that such methods can attain better numerical performance [5, 6].

The gradient information may not be always available, however, due to different reasons. For example, in an engineering application, the system to be optimized might be a black-box, e.g., a piece of closed software code with free parameters, which can be evaluated but cannot be differentiated [7]. In these cases, one needs to use a gradient-free optimization scheme, meaning that the scheme must rely only on function evaluations, rather than any sort of actual gradient information. Classical gradient-free optimization methods have attracted significant interest over the past decades [8, 9]. These methods proceed either by random search which is based on evaluating the cost function at random points and update the parameter whenever a descent in the function evaluation is achieved [8], or by constructing a numerical (finite-difference type) approximation of the gradient that can be used to take a descent step [7].

These methods are not applicable, however, if one can only obtain noisy function evaluations or one can only evaluate certain subsets of component functions in a problem like (1). In this case, since the function evaluations are not exact, random search methods cannot be used reliably. To address this problem, in recent years, a number of gradient-free stochastic optimization methods were proposed, see, e.g., [10, 11, 12, 13]. Similar to the classical case, these methods are based on the use of noisy function evaluations in order to construct a finite-difference type approximation of the gradient. However, when the cost function has multiple minima or has some regions where the gradients are zero, these methods may suffer from poor numerical performance. In particular, the optimizer can get stuck in a local minimum easily, due to its reliance on gradient approximations. Moreover, when the gradient contains little information about any minimum (e.g., in flat regions), gradient-free stochastic optimizers (as well as perfect gradient schemes) can suffer from slow convergence.

An alternative to constructing a numerical approximation of the gradient is to build up a probability measure endowed with a probability density function (pdf) whose maxima coincide with the minima of the cost function. In this way, the optimization problem can be recast as an inference problem. Indeed, one can then resort to a set of sampling techniques to obtain the probability measure and then estimate the maxima of its pdf. This approach has the advantage of enabling the reconstruction of multiple minima and finding a global minimum, since the probability measure is matched to the cost function. Methods of this nature have long been considered in the literature, including simulated annealing [14]

, Monte Carlo expectation maximization

[15]

, Markov chain Monte Carlo (MCMC) based methods

[16] or methods using sequential Monte Carlo (SMC) [17, 18, 19]. These methods have been restricted to the case where one can utilize the exact function evaluation to assess the quality of each sample. The stochastic setting, where it is only possible to compute noisy evaluations of , has also received some attention, see, e.g., [20] for a survey. However, the methods reviewed in [20] contain gradient-based Monte Carlo estimators and address a different class of stochastic optimizaton problems, where the cost function itself is defined as an expectation, rather than a finite-sum as in (1). In recent years, extensions of MCMC based sampling methods have been developed in order to sample from pdfs whose minima match those of a function in the form of eq. (1), see, e.g., [21, 22]. However, these schemes rely on the computation of noisy gradients, which we herein assume is not possible. There are also other methods using MCMC (see, e.g., [23] which employs noisy Metropolis steps) which do not require gradients. However, these techniques are primarily designed as sampling algorithms, rather than optimization methods. A perspective which is closer to our own approach was taken in [24], where an adaptive importance sampler was developed using subsampling to compute biased weights. However, the method in [24] lacks convergence guarantees.

In this paper, we propose a parallel sequential Monte Carlo optimizer (PSMCO) to minimize cost functions with finite-sum structure. The PSMCO is a zeroth-order stochastic optimization algorithm, in the sense that it only uses evaluations of small batches of individual components in (1). The proposed scheme proceeds by constructing parallel samplers each of which aims at minimizing (1). Each sampler performs subsampling without replacement to obtain its mini-batches of individual components and passes over the dataset only once. Using these mini-batches, the PSMCO constructs potential functions, propagates samples via a jittering scheme [25], and selects samples by applying a weighting-resampling procedure. The communication between parallel samplers is only necessary when an estimate of the minimum is required. In this case, the best performing sampler is selected and the minimum is estimated. We analytically prove that the estimate provided by each sampler converges almost surely to a global minimum of the cost function when the number of Monte Carlo samples at each sampler tends to infinity. We then provide numerical results for two optimization problems where classical stochastic optimization methods struggle to perform. We remark the difference between the proposed scheme and the SMC-based schemes in [17, 19] where the authors partitioned the parameter and modeled it as a dynamical system which is suitable for many global optimization problems [26]. In contrast, we aim at estimating the full parameter at each iteration.

The paper is organized as follows. After a brief Notation section, we lay out the relationship between Bayesian inference and optimization in Sec. 

2. Then, based on Sec. 2, we develop a sequential Monte Carlo scheme in Sec. 3. In Sec. 4, we analyze this scheme and investigate its theoretical properties. Then, we present experimental results in Sec. 5 and make some concluding remarks in Sec. 6.

Notation

For , we denote . The space of bounded functions on the parameter space is denoted as . The family of Borel subsets of is denoted with . The set of probability measures on the measurable space is denoted . Given and , the integral of with respect to (w.r.t) is written as

Given a Markov kernel on , we denote . If , then .

2 Optimization as Bayesian inference

In this section, we describe how to construct a sequence of probability distributions that can be linked to the solution of problem (

1). Let be the initial element of the sequence. We construct the rest of the sequence recursively as

(2)

where are potential functions [27]. The key idea is to associate these potentials with mini-batches of individual components of the cost function (subsets of the ’s) in order to construct a sequence of measures such that (for a prescribed value of ) the global maxima of the density of match the global minima of . We remark that the measures are all absolutely continuous with respect to if the potential functions , are bounded.

To construct the potentials, we use mini-batches consisting of individual functions for each iteration . To be specific, we randomly select subsets of indices , by drawing uniformly from without replacement. Each subset has elements, in such a way that we obtain subsets satisfying and when . Finally, we define the potential functions as

(3)

The following result establishes a precise connection between the optimization problem in (1) and the sequence of probability measures defined in (2).

Proposition 1.

Assume that the potentials are selected as in (3) for , with and . Let be the -th probability measure constructed by means of recursion (2). Then we have

where denotes the Radon-Nikodym derivative of with respect to the prior measure .

Proof.

See the Appendix. ∎

Remark 1.

Notice that when is a uniform probability measure on , we simply have,

where denotes the pdf (w.r.t. Lebesgue measure) of the measure . For conciseness, we abuse the notation and use , , to indicate the pdf associated to a probability measure . The two objects can be distinguished clearly by the context (e.g., for an integral , necessarily is a measure) but also by their arguments. The probability measure takes arguments or , while the pdf is a function .

Therefore, if we can construct the sequence described by (2), then we can replace the minimization problem of in (1) by the maximization of a pdf. This relationship was exploited in a Gaussian setting in [28], i.e., the special case of a Gaussian prior and log-quadratic potentials (Gaussian likelihoods), which makes it possible to implement recursion (2) analytically. The solution of this special case can be shown to match a well-known stochastic optimization algorithm, called the incremental proximal method [29], with a variable-metric. However, for general priors and potentials, it is not possible to analytically construct (2) and maximize . For this reason, we propose a simulation method to approximate the recursion (2) and solve .

3 The Algorithm

In this section we first describe a sampler to simulate from the distributions defined by recursion (2). We then describe an algorithm which runs these samplers in parallel. The parallelisation here is not primarily motivated by the computational gain (although it can be substantial). We have found that non-interacting parallel samplers are able to keep track of multiple minima better than a single “big” sampler. For this reason, we will not focus on demonstrating computational gains in the experimental section. Rather, we will discuss what parallelization brings in terms of providing better estimates.

We consider workers (corresponding to samplers). Specifically, each worker sees a different configuration of the dataset, i.e., the -th worker constructs a distinct sequence of index sets which determine the mini-batches sampled from the full set of individual components. Having obtained different mini-batches which are randomly constructed, each worker then constructs different potentials , as described in the previous section.

Workers, therefore, aim at estimating their own sequence of probability measures for . We denote the particle approximation of the posterior at time as

Overall, the algorithm retains probability distributions. Note that these distributions are different for each as they depend on different potentials.

1:Sample for .
2:for  do
3:     Jitter by generating samples
4:     Compute weights,
5:     Resample by drawing i.i.d. samples,
6:end for
Algorithm 1 Sampler on a local node

One iteration of the algorithm on a local node can be described as follows. Assume we collect a probability measure from worker , with the particle system . First, we use a jittering kernel , which is a Markov kernel on , to modify the particles [25] (see Subsection 3.1 for the precise definition of ). The idea is to jitter a subset of the particles in order to modify and propagate them into better regions of with higher probability density and lower cost. The particles are jittered by sampling,

Note that the jittering kernel may be designed so that it only modifies a subset of particles (again, see Section 3.1 for details). Next, we compute weights for the new set of particles according to the -th potential, namely

After obtaining weights, each worker performs a resampling step where for , we set for with probability . The procedure just described corresponds to a simple multinomial resampling scheme, but other standard methods can be applied as well [30]. We denote the resulting probability measure constructed at the -th iteration of the -th worker as

The full procedure for the -th worker is outlined in Algorithm 1. In Section 3.1, we elaborate on the selection of the jittering kernels to jitter particles and in Section 3.2, we detail the scheme for estimating a global minimum of from the set of random measures .

3.1 Jittering kernel

The jittering kernel constitutes one of the key design choices of the proposed algorithm. Following [25], we put the following assumption on the kernel .

Assumption 1.

The Markov kernel satisfies

for any and some constant independent of .

In this paper, we use kernels of form

(4)

where , which satisfy Assumption 1 [25]. The kernel can be rather simple, such as a multivariate Gaussian or multivariate-t distribution centered around . Other choices of are possible as well.

3.2 Estimating the global minima of

In order to estimate the global minima of , we first assess the performance of the samplers run by each worker. A typical performance measure is the marginal likelihood estimate resulting from . After choosing the worker which has attained the highest marginal likelihood (say the -th worker), we estimate a minimum of by selecting the particle that yields the highest density .

1:Sample for .
2:for  do
3:     for  do
4:         Jitter by generating samples
5:         Update the marginal likelihood,
6:         Compute weights,
7:         Resample i.i.d. samples,
8:     end for
9:     if an estimate of the solution of problem (1) is needed at time  then
10:         Choose
11:         Estimate
12:     end if
13:end for
Algorithm 2 PSMCO

To be precise, let us start by denoting the incremental marginal likelihood associated to and its estimate as and , respectively. They can be explicitly obtained by first computing

and then updating the running products

and

The quantity is a local performance index that keeps track of the quality of the -th particle system . This means that we can use to determine the best performing worker. Given the index of the best performing sampler, which is given by,

we obtain a maximum-a-posteriori (MAP) estimator,

(5)

where

is the kernel density estimator

[31, 32] described in Remark 2. Note that we do not construct the entire density estimator and maximize it. Since this operation is performed locally on the particles from the best performing sampler, it has cost, where is the number of particles on a single node, which is much smaller than the total number . The full procedure is outlined in Algorithm 2.

Remark 2.

Let

be a bounded pdf with zero mean and finite second order moment,

. We can use the particle system and the pdf to construct the kernel density estimator (KDE) of as

(6)

where . Note that is not a standard KDE because the particles are not i.i.d. samples from . Eq. (6), however, suggests that the estimator, converges when the approximate measure does. See [33] for an analysis of particle KDE’s.

4 Analysis

In this section, we provide some basic theoretical guarantees for Algorithm 2. In particular, we prove results regarding a sampler on a single worker , which hold for any . All proofs are deferred to the Appendix.

When constrained to a single worker , the approximation is provably convergent. In particular, we have the following results that hold for every worker .

Theorem 1.

Assume the sequence satisfies for every and . If , then

for every and for any where is a constant independent of .

Proof.

See the Appendix.∎

Theorem 1 states that the samplers on local nodes converge to their cprrect probability measures (for each ) with the rate , which is standard for Monte Carlo methods. Theorem 1 also enables us to obtain the following almost sure convergence result.

Corollary 1.

Under the assumptions of Theorem 1, we have

where

is an almost surely finite random variable and

is an arbitrary constant independent of . As a consequence,

(7)

for any .

Proof.

See the Appendix. ∎

This result ensures that the random error made by the estimators vanishes as . Moreover, it provides us with a rate since the constant can be chosen arbitrarily small. This result is important since it enables us to analyze the properties of the kernel density estimators constructed using the samples at each sampler. In particular, given a normalized kernel (a pdf) , satisfying the assumptions in Remark 2, we can construct the KDE

Since , it follows from Theorem 1 that (see [33]),

(8)

for some constants independent of and . Therefore, for any given , we can state the following result (see also Remark 4.4 in [33]).

Theorem 2.

Assume that (for all ) and for every . If is a zero-mean pdf with and , then

(9)
Proof.

The proof follows from Eq. (8) and Corollary 1. ∎

Theorem 2 provides a guarantee that the evaluations given by the KDE constructed from the samples in local node converge to the true evaluations for any as . Finally, we can relate empirical maxima to the true maxima.

Theorem 3.

Let be an estimate of a global maximum of . Then, under the assumptions of Theorem 2,

where .

Proof.

See the proof of Theorem 5.2 in [33]. The proof basically follows from Theorem 1 and the rate in Corollary 1. ∎

5 Experimental Results

In this section, we show numerical results for two optimization problems, which are hard to solve with conventional methods. In the first example, we focus on minimizing a function with multiple global minima. The aim of this experiment is to show that the algorithm populates multiple global minima successfully. In the second example, we minimize a challenging cost function for which standard stochastic gradient optimizers struggle.

5.1 Minimization of a function with multiple global minima

In this experiment, we tackle the problem

where

with , with . We choose the means randomly, namely where,

and . This selection results in a cost function with four global minima. This cost function is a good model for cost functions arising in many machine learning problems where there are multiple global minima, see, e.g., [34]. In this experiment, we have chosen . Although a small number for stochastic optimization problems, we note that each models a mini-batch in this scenario and we choose in our algorithm.

Figure 1: An illustration of the performance of the proposed algorithm for a cost function with four global minima. (a) The plot of . The blue regions indicate low values. It can be seen that there are four global maxima. (b) Samples drawn by the PSMCO at a single time instant. (c) The plot of the samples together with the actual cost function .

In order to run the algorithm, we choose a uniform prior measure with . It follows from Proposition 1 that the pdf that matches the cost function can be written as

and it has four global maxima. This pdf is displayed in Fig. 1(a). We run samplers, each with each particles, yielding a total number of particles . We choose a Gaussian jittering scheme; specifically, the jittering kernel is defined as

(10)

where and .

Some illustrative results can be seen from Fig. 1. To be specific, we have run independent samplers and plot all samples for this experiment (instead of estimating a minimum with the best performing sampler). From Fig. 1(b), one can see that the algorithm populates all maxima with samples. Finally, Fig. 1(c) shows the location of the samples relative to the actual cost function . This plots illustrate how the algorithm populates multiple, distinct global maxima with independent samplers. This implies that different independent samplers can report different global maxima in practice. Note that this is in agreement with the analysis provided in Section 4.

5.2 Minimization of the sigmoid function

In this experiment, we address the problem,

(11)

with , and . The function

is called as the sigmoid function. Cost functions of the form in eq. (

11

) are widely used in nonlinear regression with neural networks in machine learning

[35].

In this experiment, we have . We choose and , leading to particles for every sampler. The mini-batch size is . The jittering kernel is defined in the same way as in (10

), where the Gaussian pdf has a variance chosen as the ratio of the dataset size

to the mini-batch size , i.e., , which yields a rather large variance111Note that this is for efficient exploration of the global minima, which is hard to find for this example. A large jittering variance may not be adequate in practice when there are multiple minima close to each other, see, e.g., Section 5.1. . To compute the maximum as described in Eq. (5), we use a Gaussian kernel density with bandwidth .

Figure 2: (a) The cost function and a snapshot of samples from 50th iteration of PSMCO, PSGD with bad initialization (blue point) and PSGD with good initialization (black points). (b) The minimization performance of each algorithm. It can be seen that PSMCO first converges to the wide region with low values (blue triangle region) and then jumps to minimum. This is because the marginal likelihood estimate of the sampler close to minimum dominates after a while. So there is effectively full communication only to determine the minimizer although no exchange of information.

The results can be seen from Fig. 2

. We compare our method with a parallel stochastic gradient descent (PSGD) schemes

[36] using optimizers. We note that given a particular realization of (which is an iid sequence) with ), the cost function landscape can be hard to optimize. One can see from Fig. 2(a) that the cost function (for a particular realization of ) has broad flat regions which make it difficult to minimize even for gradient based methods unless their initialization is sufficiently good. Accordingly, we have run two instances of PSGD with “bad” and “good” initializations.

The bad initial point for PSGD can be seen from Fig 2(a), at (the blue dot). We initialize parallel SGD optimizers around , each with a small zero-mean Gaussian perturbation with variance . This is a poor initialization because gradients are nearly zero in this region (yellow triangle in Fig. 2(a)). We refer to the PSGD algorithm starting from this point as PSGD with B/I, which refers to bad initialization. We also initialize the PSMCO from this region, with Gaussian perturbations around , with the same small variance of .

The “good” initialization for the PSGD is selected from a better region, namely around the point , where gradient values actually contain useful information about the minimum. We refer to the PSGD algorithm starting from this point as PSGD with G/I.

The results and some comments can be seen from Fig. 2(b). It can be seen that the PSGD with good initialization (G/I) moves towards a better region, however, it gets stuck because gradients become zero. On the other hand, PSGD with B/I is unable to move at all, since it is initialized in a region where all gradients are zero (which is true even for the mini-batch observations). PSMCO, on the other hand, is able to search the space effectively to find the global minimum, which is clearly reflected in Fig. 2(b).

6 Conclusions

We have proposed a parallel sequential Monte Carlo optimizer to minimize challenging cost functions, e.g., with multiple global minima or with wide flat regions. The algorithm uses jittering kernels to propagate samples [25] and particle kernel density estimators to find the minima [33], within a stochastic optimization setup. We have shown that, on a single (local) node, the algorithm is provably convergent. On the global level, we argue that the parallel setting where each sampler uses a different configuration of the same dataset can be useful to improve the practical convergence of the algorithms. The numerical performance of our algorithm in difficult scenarios shows that this is a promising direction. In this work, we have focused on challenging but low dimensional cost functions. We leave the potential applications of our scheme to high-dimensional optimization problems as a future work. Also the design of an interacting extension of our method similar to particle islands [37] can be potentially useful in more challenging settings.

Acknowledgements

An important part of this work was carried out when Ö. D. A. was visiting Department of Mathematics, Imperial College London. This work was partially supported by Ministerio de Economía y Competitividad of Spain (TEC2015-69868-C2-1-R ADVENTURE), and the regional government of Madrid (program CASICAM-CM S2013/ICE-2845).

Appendix

Proof of Proposition 1. Notice that the Radon-Nikodym derivative of the final measure with respect to the prior is

From here, it easily follows that maximizing this Radon-Nikodym derivative is equivalent to solving problem (1).

Proof of Theorem 1. For clarity, we drop the worker index from notation and write (this convention holds for all terms involving ). We proceed by an induction argument. At time , the bound

is a straightforward consequence of the Marcinkiewicz–Zygmund inequality [38] because the particles are i.i.d samples from .

Assume now that, after iteration , we have a particle set and the empirical measure , which satisfies

(12)

We first analyze the error in the jittering step. To this end, we construct the jittered random measure

and iterate the triangle inequality to obtain

(13)

where

The first term on the right hand side (rhs) of (13) is bounded by the induction hypothesis (12). For the second term, we note that,

(14)

where the last inequality follows from Assumption 1. The upper bound in (14) is deterministic, so the inequality readily implies that

(15)

For the last term in (13), we let be the -algebra generated by the random sequence . Let us first note that

Therefore, the difference takes the form

where , , are zero-mean and independent (conditionally on ) random variables, with . Then we readily obtain the bound

(16)

where the relation (16) follows from the Marcinkiewicz–Zygmund inequality [38] and is some constant independent of . Taking unconditional expectation on both sides of (16) and then computing yields

(17)

where is a finite constant independent of . Therefore, taking together (12), (15) and (17) we have established that

(18)

where is a finite constant independent of .

Next, we have to bound the error after the weighting step. We recall

where denotes the weighted measure. We first note that