Adaptive Sampling Quasi-Newton Methods for Derivative-Free Stochastic Optimization

10/29/2019
by   Raghu Bollapragada, et al.
0

We consider stochastic zero-order optimization problems, which arise in settings from simulation optimization to reinforcement learning. We propose an adaptive sampling quasi-Newton method where we estimate the gradients of a stochastic function using finite differences within a common random number framework. We employ modified versions of a norm test and an inner product quasi-Newton test to control the sample sizes used in the stochastic approximations. We provide preliminary numerical experiments to illustrate potential performance benefits of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/24/2021

Adaptive Sampling Quasi-Newton Methods for Zeroth-Order Stochastic Optimization

We consider unconstrained stochastic optimization problems with no avail...
02/23/2017

Stochastic Newton and Quasi-Newton Methods for Large Linear Least-squares Problems

We describe stochastic Newton and stochastic quasi-Newton approaches to ...
03/07/2021

Retrospective Approximation for Smooth Stochastic Optimization

We consider stochastic optimization problems where a smooth (and potenti...
10/30/2017

Adaptive Sampling Strategies for Stochastic Optimization

In this paper, we propose a stochastic optimization method that adaptive...
12/10/2019

A Stochastic Quasi-Newton Method for Large-Scale Nonconvex Optimization with Applications

This paper proposes a novel stochastic version of damped and regularized...
02/20/2021

Estimation and Inference by Stochastic Optimization: Three Examples

This paper illustrates two algorithms designed in Forneron Ng (2020)...
04/05/2017

On the construction of probabilistic Newton-type algorithms

It has recently been shown that many of the existing quasi-Newton algori...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider unconstrained stochastic optimization problems of the form

(1)

where one has access only to a zero-order oracle (i.e., a black-box procedure that outputs realizations of the stochastic function values and cannot access explicit estimates of the gradient ). Such stochastic optimization problems arise in a plethora of science and engineering applications, from simulation optimization [4, 9, 11, 15] to reinforcement learning [3, 13, 17]. Several methods have been proposed to solve such “derivative-free” problems [1, 12].

We propose finite-difference stochastic quasi-Newton methods for solving (1) by exploiting common random number (CRN) evaluations of . The CRN setting allows us to define subsampled gradient estimators

(2)
(3)

which employ forward differences for the i.i.d. samples of in the set and whereby the parameter needs to account only for numerical errors. This gradient estimation has two sources of error: error due to the finite-difference approximation and error due to the stochastic approximation. The error due to stochastic approximation depends on the number of samples used in the estimation. Using too few samples affects the stability of a method using the estimates; using a large number of samples results in computational inefficiency. For settings where gradient information is available, researchers have developed practical tests to adaptively increase the sample sizes used in the stochastic approximations and have supported these tests with global convergence results [5, 6, 7, 16]. In this paper we modify these tests to address the challenges associated with the finite-difference approximation errors, and we demonstrate the resulting method on simple test problems.

2 A Derivative-Free Stochastic Quasi-Newton Algorithm

The update form of a finite-difference, derivative-free stochastic quasi-Newton method is given by

(4)

where is the steplength, is a positive-definite quasi-Newton matrix, and is the (e.g., forward) finite-difference subsampled (or batch) gradient estimate defined by (3). We propose to control the sample sizes over the course of optimization to achieve fast convergence by using two different strategies adapted to the setting where no gradient information is available.

Norm test.

The first test is based on the norm condition [7, 10]:

(5)

which is used for controlling the sample sizes in subsampled gradient methods. We note that the finite-difference approximation error (first pair of terms) in is nonzero and can be upper bounded (independent of ) for any function with -Lipschitz continuous gradients:

(6)

the proof is given in the supplementary material. Therefore, one cannot always satisfy (5); moreover, satisfying (5) might be too restrictive. Instead, we propose to look at the norm condition based on the finite-difference subsampled gradient estimators. That is, we use the condition

(7)

for which and which corresponds to a norm condition where the right-hand side of (5) is relaxed. The left-hand side of (7

) is difficult to compute but can be bounded by the true variance of individual finite-difference gradient estimators; this results in

(8)

Approximating the true expected gradient and variance with sample gradient and variance estimates, respectively, yields the practical finite-difference norm test:

(9)

where is a subset of the current sample (batch). In our algorithm, we test condition (9); whenever it is not satisfied, we control such that the condition is satisfied.

Inner product quasi-Newton test.

The norm condition controls the variance in the gradient estimation and does not utilize observed quasi-Newton information to control the sample sizes. Recently, Bollapragada et al. [6]

proposed to control the sample sizes used in the gradient estimation by ensuring that the stochastic quasi-Newton directions make an acute angle with the true quasi-Newton direction with high probability. That is,

(10)

holds with high probability. This condition can be satisfied in expectation at points that are sufficiently far away from the stationary points; that is, for points such that , where and

, are the largest and smallest eigenvalues of

, respectively (see supplementary material). Hence, the condition (10) can be satisfied with high probability at points that are sufficiently far away from being stationary points. We must control the variance in the left-hand side of (10) to achieve this objective. We note that the quantity cannot be computed directly; however, it can be approximated by . The condition is given as

(11)

where . The left-hand side of (11) can be bounded by the true variance, as done before. Therefore, the following condition is sufficient for ensuring that (11) holds:

(12)

Approximating the true expected gradient and variance with sample gradient and variance estimates results in the practical finite-difference inner product quasi-Newton test:

(13)

where

is a subset of the current sample (batch).This variance computation requires only one additional Hessian-vector product (i.e., the product of

with ). In our algorithm, we test the condition (13); whenever it is not satisfied, we control to satisfy the condition.

Finite-difference parameter and steplength selection.

We select the finite-difference parameter by minimizing an upper bound on the error in the gradient approximation. Assuming that numerical errors in computing are uniformly bounded by yields the parameter value This finite-difference parameter is analogous to the one derived in [14], which depends on the variance in stochastic noise; however, in the CRN setting we need to account only for numerical errors.

We employ a stochastic line search to choose the steplength by using a sufficient decrease condition based on the sampled function values. In particular, we would like to satisfy

(14)

where and is a user-specified parameter. We employ a backtracking procedure wherein a trial that does not satisfy (14) is reduced by a fixed fraction (i.e., ). One cannot always satisfy (14), since the quasi-Newton direction may not be a descent direction for the sampled function at . Intuitively, at points where the error in the sample gradient estimation error dominates the sample gradient itself, measured in terms of the matrix norm , this condition may not be satisfied, and the line search fails. In practice, stochastic line search failure is an indication that the method has converged to a neighborhood of a solution for the sampled function , and the solution cannot be further improved. Therefore, one can use the line search failure as an early stopping criterion.

We also note that because of the stochasticity in the function values, it is not guaranteed that a decrease in stochastic function realizations can ensure decrease in the true function. A conservative strategy to address this issue is to choose the initial trial steplength to be small enough such that the increase in function values (when the stochastic approximations are not good) is controlled. Following the strategy proposed in [6]

, we derive a heuristic to choose the initial steplength as

(15)

where is the sample variance used in (9).

Stable quasi-Newton update.

In BFGS and L-BFGS methods, the inverse Hessian approximation is updated by using the formulae

(16)

where and is the difference in the gradients at and . In stochastic settings, several recent works [2, 6, 18] define as the difference in gradients measured on the same sample to ensure stability in the quasi-Newton approximation. We follow the same approach and define

(17)

However, even though computing gradient differences on common sample sets can improve stability, the curvature pair still may not satisfy the condition required to ensure positive definiteness of the quasi-Newton matrix. Therefore, in our tests we skip the quasi-Newton update if the condition is not satisfied with .

Convergence results.

We leave to future work the settings under which one can establish global convergence results to a neighborhood of an optimal solution for the proposed methods.

3 Numerical Experiments

As a demonstration, we conducted preliminary numerical experiments on stochastic nonlinear least-squares problems based on a mapping affected by two forms of stochastic noise:

(18)

where and . In both cases, the function and the expected function are twice continuously differentiable. Here we report results only for defined by the Chebyquad function from [8] with , , and an approximate noise-free value .

We implemented two variants, “FD-Norm” and “FD-IPQN”, of the proposed algorithm using L-BFGS with chosen based on either (9) or (13), respectively. We also implemented a finite-difference stochastic gradient method (“FD-SG”) , where is defined in (3) and . We report results for the best version of FD-SG based on tuning the constant steplength for each problem (e.g., considering , for ). We use the initial sample for larger variance () problems and for smaller variance problems () in all the methods.

Figure 1 measures the error in the function, , in terms of the total (i.e., including those in the gradient estimates, curvature pair updates, and line search) number of evaluations of . The results show that both variants of our finite-difference quasi-Newton method are more efficient than the tuned finite-difference stochastic gradient method. Furthermore, the stochastic gradient method converged to a significantly larger neighborhood of the solution as compared with the quasi-Newton variants. No significant difference in performance was observed between the norm test and inner product quasi-Newton test. These preliminary numerical results show that the modified tests have potential for stochastic problems where the CRN approach is feasible.

Figure 1: Chebyquad function results for a single run: Using (left column) and (right column) with (top row) and (bottom row).

Acknowledgments

This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, applied mathematics and SciDAC programs under Contract No. DE-AC02-06CH11357.

References

  • Audet and Hare [2017] C. Audet and W. L. Hare. Derivative-Free and Blackbox Optimization. Springer, 2017. doi: 10.1007/978-3-319-68913-5.
  • Berahas et al. [2016] A. S. Berahas, J. Nocedal, and M. Takác.

    A multi-batch L-BFGS method for machine learning.

    In Advances in Neural Information Processing Systems, pages 1055–1063, 2016.
  • Bertsekas [2019] D. P. Bertsekas. Reinforcement Learning and Optimal Control. Athena Scientific, 2019.
  • Blanchet et al. [2019] J. Blanchet, C. Cartis, M. Menickelly, and K. Scheinberg. Convergence rate analysis of a stochastic trust region method via submartingales. INFORMS Journal on Optimization, 2019. URL https://arxiv.org/abs/1609.07428. To appear.
  • Bollapragada et al. [2018a] R. Bollapragada, R. Byrd, and J. Nocedal. Adaptive sampling strategies for stochastic optimization. SIAM Journal on Optimization, 28(4):3312–3343, 2018a.
  • Bollapragada et al. [2018b] R. Bollapragada, J. Nocedal, D. Mudigere, H. Shi, and P. Tang. A progressive batching l-BFGS method for machine learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 620–629, 2018b.
  • Byrd et al. [2012] R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127–155, 2012.
  • Fletcher [1965] R. Fletcher. Function minimization without evaluating derivatives – a review. The Computer Journal, 8:33–41, 1965. doi: 10.1093/comjnl/8.1.33.
  • Fu et al. [2005] M. C. Fu, F. W. Glover, and J. April. Simulation optimization: A review, new developments, and applications. In Proceedings of the Winter Simulation Conference. IEEE, 2005. doi: 10.1109/wsc.2005.1574242.
  • Hashemi et al. [2014] F. S. Hashemi, S. Ghosh, and R. Pasupathy. On adaptive sampling rules for stochastic recursions. In Proceedings of the Winter Simulation Conference 2014, pages 3959–3970. IEEE, 2014.
  • Kim et al. [2015] S. Kim, R. Pasupathy, and S. G. Henderson. A guide to sample average approximation. In M. Fu, editor, Handbook of Simulation Optimization, volume 216 of International Series in Operations Research & Management Science, pages 207–243. Springer, 2015. doi: 10.1007/978-1-4939-1384-8_8.
  • Larson et al. [2019] J. Larson, M. Menickelly, and S. M. Wild. Derivative-free optimization methods. Acta Numerica, 28:287–404, 2019. doi: 10.1017/s0962492919000060.
  • Mania et al. [2018] H. Mania, A. Guy, and B. Recht. Simple random search of static linear policies is competitive for reinforcement learning. In Advances in Neural Information Processing Systems 31, pages 1800–1809. Curran Associates, Inc., 2018.
  • Moré and Wild [2012] J. J. Moré and S. M. Wild. Estimating derivatives of noisy simulations. ACM Transactions on Mathematical Software, 38(3):19:1–19:21, 2012. doi: 10.1145/2168773.2168777.
  • Pasupathy and Ghosh [2013] R. Pasupathy and S. Ghosh. Simulation optimization: A concise overview and implementation guide. In Theory Driven by Influential Applications, pages 122–150. INFORMS, 2013. doi: 10.1287/educ.2013.0118.
  • Pasupathy et al. [2018] R. Pasupathy, P. Glynn, S. Ghosh, and F. S. Hashemi. On sampling rates in simulation-based recursions. SIAM Journal on Optimization, 28(1):45–73, 2018.
  • Salimans et al. [2017] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. Technical Report 1703.03864, ArXiv, 2017. URL https://arxiv.org/abs/1703.03864.
  • Schraudolph et al. [2007] N. N. Schraudolph, J. Yu, and S. Günter. A stochastic quasi-Newton method for online convex optimization. In

    International Conference on Artificial Intelligence and Statistics

    , pages 436–443, 2007.

Supplementary Material

Finite-difference approximation error.

For any and any function with -Lipschitz continuous gradients, we have that

where the first equality is by the definitions of and . The first inequality is due to the fact that , and the second inequality is due to

In a similar manner, we can show that, for the true gradients, the error due to finite-difference approximation is bounded. That is,

(19)

Inner product quasi-Newton expected condition.

Consider the following:

where we used

and we have,

where the last inequality is due to (19). Therefore,

For any such that , where are the largest and smallest eigenvalues of , respectively, it thus follows that

Numerical experiments setup.

In the tests of the proposed algorithm, we use , finite-difference parameter , L-BFGS memory parameter , and line search parameters (, ). None of these parameters have been tuned to the problems being considered.

The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. http://energy.gov/downloads/doe-public-access-plan.