1 Introduction
We consider unconstrained stochastic optimization problems of the form
(1) 
where one has access only to a zeroorder oracle (i.e., a blackbox procedure that outputs realizations of the stochastic function values and cannot access explicit estimates of the gradient ). Such stochastic optimization problems arise in a plethora of science and engineering applications, from simulation optimization [4, 9, 11, 15] to reinforcement learning [3, 13, 17]. Several methods have been proposed to solve such “derivativefree” problems [1, 12].
We propose finitedifference stochastic quasiNewton methods for solving (1) by exploiting common random number (CRN) evaluations of . The CRN setting allows us to define subsampled gradient estimators
(2)  
(3) 
which employ forward differences for the i.i.d. samples of in the set and whereby the parameter needs to account only for numerical errors. This gradient estimation has two sources of error: error due to the finitedifference approximation and error due to the stochastic approximation. The error due to stochastic approximation depends on the number of samples used in the estimation. Using too few samples affects the stability of a method using the estimates; using a large number of samples results in computational inefficiency. For settings where gradient information is available, researchers have developed practical tests to adaptively increase the sample sizes used in the stochastic approximations and have supported these tests with global convergence results [5, 6, 7, 16]. In this paper we modify these tests to address the challenges associated with the finitedifference approximation errors, and we demonstrate the resulting method on simple test problems.
2 A DerivativeFree Stochastic QuasiNewton Algorithm
The update form of a finitedifference, derivativefree stochastic quasiNewton method is given by
(4) 
where is the steplength, is a positivedefinite quasiNewton matrix, and is the (e.g., forward) finitedifference subsampled (or batch) gradient estimate defined by (3). We propose to control the sample sizes over the course of optimization to achieve fast convergence by using two different strategies adapted to the setting where no gradient information is available.
Norm test.
The first test is based on the norm condition [7, 10]:
(5) 
which is used for controlling the sample sizes in subsampled gradient methods. We note that the finitedifference approximation error (first pair of terms) in is nonzero and can be upper bounded (independent of ) for any function with Lipschitz continuous gradients:
(6) 
the proof is given in the supplementary material. Therefore, one cannot always satisfy (5); moreover, satisfying (5) might be too restrictive. Instead, we propose to look at the norm condition based on the finitedifference subsampled gradient estimators. That is, we use the condition
(7) 
for which and which corresponds to a norm condition where the righthand side of (5) is relaxed. The lefthand side of (7
) is difficult to compute but can be bounded by the true variance of individual finitedifference gradient estimators; this results in
(8) 
Approximating the true expected gradient and variance with sample gradient and variance estimates, respectively, yields the practical finitedifference norm test:
(9) 
where is a subset of the current sample (batch). In our algorithm, we test condition (9); whenever it is not satisfied, we control such that the condition is satisfied.
Inner product quasiNewton test.
The norm condition controls the variance in the gradient estimation and does not utilize observed quasiNewton information to control the sample sizes. Recently, Bollapragada et al. [6]
proposed to control the sample sizes used in the gradient estimation by ensuring that the stochastic quasiNewton directions make an acute angle with the true quasiNewton direction with high probability. That is,
(10) 
holds with high probability. This condition can be satisfied in expectation at points that are sufficiently far away from the stationary points; that is, for points such that , where and
, are the largest and smallest eigenvalues of
, respectively (see supplementary material). Hence, the condition (10) can be satisfied with high probability at points that are sufficiently far away from being stationary points. We must control the variance in the lefthand side of (10) to achieve this objective. We note that the quantity cannot be computed directly; however, it can be approximated by . The condition is given as(11) 
where . The lefthand side of (11) can be bounded by the true variance, as done before. Therefore, the following condition is sufficient for ensuring that (11) holds:
(12) 
Approximating the true expected gradient and variance with sample gradient and variance estimates results in the practical finitedifference inner product quasiNewton test:
(13) 
where
is a subset of the current sample (batch).This variance computation requires only one additional Hessianvector product (i.e., the product of
with ). In our algorithm, we test the condition (13); whenever it is not satisfied, we control to satisfy the condition.Finitedifference parameter and steplength selection.
We select the finitedifference parameter by minimizing an upper bound on the error in the gradient approximation. Assuming that numerical errors in computing are uniformly bounded by yields the parameter value This finitedifference parameter is analogous to the one derived in [14], which depends on the variance in stochastic noise; however, in the CRN setting we need to account only for numerical errors.
We employ a stochastic line search to choose the steplength by using a sufficient decrease condition based on the sampled function values. In particular, we would like to satisfy
(14) 
where and is a userspecified parameter. We employ a backtracking procedure wherein a trial that does not satisfy (14) is reduced by a fixed fraction (i.e., ). One cannot always satisfy (14), since the quasiNewton direction may not be a descent direction for the sampled function at . Intuitively, at points where the error in the sample gradient estimation error dominates the sample gradient itself, measured in terms of the matrix norm , this condition may not be satisfied, and the line search fails. In practice, stochastic line search failure is an indication that the method has converged to a neighborhood of a solution for the sampled function , and the solution cannot be further improved. Therefore, one can use the line search failure as an early stopping criterion.
We also note that because of the stochasticity in the function values, it is not guaranteed that a decrease in stochastic function realizations can ensure decrease in the true function. A conservative strategy to address this issue is to choose the initial trial steplength to be small enough such that the increase in function values (when the stochastic approximations are not good) is controlled. Following the strategy proposed in [6]
, we derive a heuristic to choose the initial steplength as
(15) 
where is the sample variance used in (9).
Stable quasiNewton update.
In BFGS and LBFGS methods, the inverse Hessian approximation is updated by using the formulae
(16) 
where and is the difference in the gradients at and . In stochastic settings, several recent works [2, 6, 18] define as the difference in gradients measured on the same sample to ensure stability in the quasiNewton approximation. We follow the same approach and define
(17) 
However, even though computing gradient differences on common sample sets can improve stability, the curvature pair still may not satisfy the condition required to ensure positive definiteness of the quasiNewton matrix. Therefore, in our tests we skip the quasiNewton update if the condition is not satisfied with .
Convergence results.
We leave to future work the settings under which one can establish global convergence results to a neighborhood of an optimal solution for the proposed methods.
3 Numerical Experiments
As a demonstration, we conducted preliminary numerical experiments on stochastic nonlinear leastsquares problems based on a mapping affected by two forms of stochastic noise:
(18) 
where and . In both cases, the function and the expected function are twice continuously differentiable. Here we report results only for defined by the Chebyquad function from [8] with , , and an approximate noisefree value .
We implemented two variants, “FDNorm” and “FDIPQN”, of the proposed algorithm using LBFGS with chosen based on either (9) or (13), respectively. We also implemented a finitedifference stochastic gradient method (“FDSG”) , where is defined in (3) and . We report results for the best version of FDSG based on tuning the constant steplength for each problem (e.g., considering , for ). We use the initial sample for larger variance () problems and for smaller variance problems () in all the methods.
Figure 1 measures the error in the function, , in terms of the total (i.e., including those in the gradient estimates, curvature pair updates, and line search) number of evaluations of . The results show that both variants of our finitedifference quasiNewton method are more efficient than the tuned finitedifference stochastic gradient method. Furthermore, the stochastic gradient method converged to a significantly larger neighborhood of the solution as compared with the quasiNewton variants. No significant difference in performance was observed between the norm test and inner product quasiNewton test. These preliminary numerical results show that the modified tests have potential for stochastic problems where the CRN approach is feasible.
Acknowledgments
This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, applied mathematics and SciDAC programs under Contract No. DEAC0206CH11357.
References
 Audet and Hare [2017] C. Audet and W. L. Hare. DerivativeFree and Blackbox Optimization. Springer, 2017. doi: 10.1007/9783319689135.

Berahas et al. [2016]
A. S. Berahas, J. Nocedal, and M. Takác.
A multibatch LBFGS method for machine learning.
In Advances in Neural Information Processing Systems, pages 1055–1063, 2016.  Bertsekas [2019] D. P. Bertsekas. Reinforcement Learning and Optimal Control. Athena Scientific, 2019.
 Blanchet et al. [2019] J. Blanchet, C. Cartis, M. Menickelly, and K. Scheinberg. Convergence rate analysis of a stochastic trust region method via submartingales. INFORMS Journal on Optimization, 2019. URL https://arxiv.org/abs/1609.07428. To appear.
 Bollapragada et al. [2018a] R. Bollapragada, R. Byrd, and J. Nocedal. Adaptive sampling strategies for stochastic optimization. SIAM Journal on Optimization, 28(4):3312–3343, 2018a.
 Bollapragada et al. [2018b] R. Bollapragada, J. Nocedal, D. Mudigere, H. Shi, and P. Tang. A progressive batching lBFGS method for machine learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 620–629, 2018b.
 Byrd et al. [2012] R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127–155, 2012.
 Fletcher [1965] R. Fletcher. Function minimization without evaluating derivatives – a review. The Computer Journal, 8:33–41, 1965. doi: 10.1093/comjnl/8.1.33.
 Fu et al. [2005] M. C. Fu, F. W. Glover, and J. April. Simulation optimization: A review, new developments, and applications. In Proceedings of the Winter Simulation Conference. IEEE, 2005. doi: 10.1109/wsc.2005.1574242.
 Hashemi et al. [2014] F. S. Hashemi, S. Ghosh, and R. Pasupathy. On adaptive sampling rules for stochastic recursions. In Proceedings of the Winter Simulation Conference 2014, pages 3959–3970. IEEE, 2014.
 Kim et al. [2015] S. Kim, R. Pasupathy, and S. G. Henderson. A guide to sample average approximation. In M. Fu, editor, Handbook of Simulation Optimization, volume 216 of International Series in Operations Research & Management Science, pages 207–243. Springer, 2015. doi: 10.1007/9781493913848_8.
 Larson et al. [2019] J. Larson, M. Menickelly, and S. M. Wild. Derivativefree optimization methods. Acta Numerica, 28:287–404, 2019. doi: 10.1017/s0962492919000060.
 Mania et al. [2018] H. Mania, A. Guy, and B. Recht. Simple random search of static linear policies is competitive for reinforcement learning. In Advances in Neural Information Processing Systems 31, pages 1800–1809. Curran Associates, Inc., 2018.
 Moré and Wild [2012] J. J. Moré and S. M. Wild. Estimating derivatives of noisy simulations. ACM Transactions on Mathematical Software, 38(3):19:1–19:21, 2012. doi: 10.1145/2168773.2168777.
 Pasupathy and Ghosh [2013] R. Pasupathy and S. Ghosh. Simulation optimization: A concise overview and implementation guide. In Theory Driven by Influential Applications, pages 122–150. INFORMS, 2013. doi: 10.1287/educ.2013.0118.
 Pasupathy et al. [2018] R. Pasupathy, P. Glynn, S. Ghosh, and F. S. Hashemi. On sampling rates in simulationbased recursions. SIAM Journal on Optimization, 28(1):45–73, 2018.
 Salimans et al. [2017] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. Technical Report 1703.03864, ArXiv, 2017. URL https://arxiv.org/abs/1703.03864.

Schraudolph et al. [2007]
N. N. Schraudolph, J. Yu, and S. Günter.
A stochastic quasiNewton method for online convex optimization.
In
International Conference on Artificial Intelligence and Statistics
, pages 436–443, 2007.
Supplementary Material
Finitedifference approximation error.
For any and any function with Lipschitz continuous gradients, we have that
where the first equality is by the definitions of and . The first inequality is due to the fact that , and the second inequality is due to
In a similar manner, we can show that, for the true gradients, the error due to finitedifference approximation is bounded. That is,
(19) 
Inner product quasiNewton expected condition.
Consider the following:
where we used
and we have,
where the last inequality is due to (19). Therefore,
For any such that , where are the largest and smallest eigenvalues of , respectively, it thus follows that
Numerical experiments setup.
In the tests of the proposed algorithm, we use , finitedifference parameter , LBFGS memory parameter , and line search parameters (, ). None of these parameters have been tuned to the problems being considered.
The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DEAC0206CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paidup nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. http://energy.gov/downloads/doepublicaccessplan.
Comments
There are no comments yet.