1 Introduction
Many applications in robotics (schaal2010learning), manufacturing (maier2018turning)
, health sciences, finance, etc. require minimizing a loss function under constraints and uncertainty. Optimizing a loss function under partially revealed constraints can be further complicated by the fact that observations are available only inside the feasible set. Hence, one needs to carefully choose actions to ensure the feasibility of each iterate while pursuing the optimal solution. In the machine learning community, this problem is known as
safe learning. For such tasks, feasible optimization methods are required. There are many first and second order feasible methods in the literature. Although given noisy zeroth order oracle the Hessians are hard to estimate with good accuracy, we can approximate derivatives using finite differences. Most well known and widely used first order methods for stochastic optimization are dealing with constraints using projections. However, the lack of global knowledge of the constraint functionals makes it impossible to compute the corresponding projection operator.
Related work.
There is a lack of zeroth order feasible (safe) algorithms for blackbox constrained optimization in the literature. balasubramanian2018zeroth provide a comprehensive analysis of the performance of several zeroth order algorithms for nonconvex optimization. However, the conditional gradient based algorithm (FrankWolfe) of balasubramanian2018zeroth for constrained nonconvex problems requires global knowledge of the constraint functions, as the linear objective must be optimized with respect to these constraints. usmanova2019safe introduce a FrankWolfe based algorithm applied to the case of noisy zeroth order oracle for linear constraints. This work proves the feasibility of iterates with high probability and bounded the convergence rate with high probability, but does require convexity.
Nonconvex nonsmooth problems can be addressed by feasible methods, such as the first order Method of Feasible Directions (topkis1967convergence) or the second order Feasible Sequential Quadratic Programming (FSQP) algorithm (tang2014feasible). Another algorithm for nonconvex nonsmooth problems is given by facchinei2017feasible. The idea of this algorithm is to use as a direction of movement the minimizer of a local convex approximation of the objective subject to a local convex approximation of constraint set. Unfortunately, all the guarantees for the above methods are in terms of asymptotic convergence to a stationary point.
Another class of safe algorithms for global blackbox optimization is based on Bayesian Optimization(BO) such as SafeOpt (sui2015safe) and its extensions (berkenkamp2016bayesian). The main drawback of these methods is that the computational complexity grows exponentially with dimensionality.
Interior Point Methods (IPM) are feasible approaches by definition, and they are widely used for Linear Programming, Quadratic Programming, and Conic optimization problems. By using selfconcordance properties of specifically chosen barriers and second order information, these problems are shown to be extremely efficiently solved by IPM. However, in the cases when constraints are unknown, building the barrier with selfconcordance properties is not possible. In these cases it is possible to use logarithmic barriers for general blackbox constraints.
hinder2018one propose to choose adaptive step sizes for the gradient algorithm for the log barriers and give the analysis of the convergence rate. The work of hinder2018one assume knowledge of the exact gradients of the cost and constraint functions. In the present work, we extend this approach for the case in which we only have a possibly noisy zeroth order oracle.Our Contributions.
In this paper we propose the first safe zeroth order algorithm for nonconvex optimization with a blackbox noisy oracle. We prove that it generates feasible iterations with high probability, and analyse its convergence rate to the local optimum. Each iteration is computationally very cheap and does not require solving any subproblems such as those required for FrankWolfe or Bayesian Optimization based algorithms. In Table 1, we provide a comparison of our algorithm with the existing ones for unconstrained and constrained zeroth order nonconvex optimization. In the first two algorithms a multiple point feedback is assumed, i.e., it is possible to measure at several points with the same noise realization. The convergence rate in the second column is proven for known polyhedral constraints. In our algorithm we assume a more realistic and more complicated setup where the noise is changing with each measurement. There are some works in zeroth order 1point feedback for convex optimization with known constraints, for example bach2016highly require measurements to achieve accuracy.
Problem  Unconstrained  Known constraints  Safe despite unknown constraints 
Feedback  Noisy 2point  Noisy 2point  Noisy 1point 
Optimality criterion  Stationary point:

stationary point:

approximate scaled KKT point:

Number of measurements 
(no matrix inversion) (hessian estimation) (Balasubramanian, Ghadimi, 2018) 
(conditional gradient based, requires optimization subproblems w.r.t. ) (Balasubramanian, Ghadimi, 2018) 
()
(this paper, does not require solving subproblems or matrix inversions) 

2 Problem Statement
Notations and definitions.
Let , and denote norm, norm and norm respectively on . A function is called Lipschitz continuous if
(1) 
It is called smooth if the gradients are Lipschitz continuous, i.e.,
(2) 
which implies that .
Problem formulation
We consider the problem of nonconvex safe learning defined as a constrained optimization problem
(3) 
where the objective function and the constraints are unknown continuous functions, and can only be accessed at feasible points . We denote by the feasible set
Assumption 1.
Let . The objective and the constraint functions for are smooth on . Also, constraint functions for are Lipschitz continuous on .
Assumption 2.
The feasible set has a nonempty interior, and there exists a known starting point for which for
Oracle information.
We consider zeroth order oracle information. If we can measure the function values exactly, then one possible oracle is the Exact Zeroth Order oracle (EZO).
EZO: provides the exact values for any requested point
In many applications, the measurements of the functions are noisy. We assume that the additive noise is coming from a zeromean
subGaussian distribution. We call this oracle
Stochastic Zeroth Order oracle (SZO).SZO: provides the noisy function values , is a zeromean subGaussian noise independent of previous measurements for any requested point . We assume that the noise values are independent over time and indices .
Optimality criterion.
The condition is usually used as an optimality criterion in nonconvex smooth optimization without constraints. It is well known that in the unconstrained case the classical gradient descent method converges with rate which matches the lower bound derived for this class of problems (carmon2017lower). In the nonconvex constrained optimization, first order criteria are KarushKuhnTucker (KKT) conditions, which are necessary in the presence of regularity conditions called Constraint Qualification (CQ). In such cases, we can measure the solution accuracy by satisfying approximate KKT conditions. The point is called a KKT point if it satisfies the necessary condition for local optimality:
where
is the vector of dual variables and
is the Lagrangian. There are several ways to define an approximate KKT point, see (cartis2014complexity; birgin2016evaluation). Similar to cartis2014complexity, we define an approximate KKT point as follows. An approximate scaled KKT point (sKKT) for some is a pair which satisfies:(sKKT.1)  
(sKKT.2)  
(sKKT.3) 
Note that the approximation lies in substituting the equality constraints of the standard KKT with inequalities (sKKT.2), (sKKT.3). In case can be uniformly bounded by some constant , the scaled approximate KKT point is an unscaled approximate KKT point.
3 Preliminaries
3.1 Log barrier algorithm.
We address the safe learning problem using the log barriers approach. The log barrier function and its gradient are defined as follows:
(4)  
(5) 
The main idea of the log barrier methods is to solve a sequence of the barrier subproblems
(6) 
with decreasing for some For the rest of the paper we fix to be constant, since we propose an algorithm to solve the subproblems, which is challenging given only noisy function evaluations. Let us set the pair of primal and dual variables to be , where for Then, one can verify the following properties:
Now recall (sKKT.1)(sKKT.3) and suppose that is such that Then, one can verify that the pair is an approximate scaled KKT point.
First order log barrier algorithm (hinder2019poly).
For structured problems like conic optimization with known selfconcordant barriers and computable Hessians, to solve the log barrier subproblems (6) barrier methods classically use the Newton algorithm. However, since we assume that the structure is unknown, and the second order information is inaccessible, we would like to use gradient descent type methods with the step direction to find a solution of the barrier subproblem (6). The main drawback of solving and analysing the log barrier subproblems using gradient methods is that the log barriers themselves are nonLipschitz continuous and nonsmooth functions since their gradients grow to infinity on the boundary. This might lead to unstable behaviour of the gradient based algorithm close to the boundary, and the step sizes have to be chosen exponentially small. To handle this drawback, in the paper (hinder2019poly) the authors proposed to choose an adaptive step size , where represents a local Lipschitz constant of at the point . The convergence rate for finding the solution of the subproblem , i.e., convergence to an approximate KKT point using their algorithm with adaptive step sizes is formulated in Theorem 1.
Theorem 1.
(Claim 2. (hinder2019poly)) Under Assumption 1 for any constant , after at most iterations such that
the procedure finds a point such that
hinder2019poly were first to derive such rates for first order log barrier methods for general Lipschitz, smooth functions because of their adaptive step size. This choice allowed them to define and use local Lipschitz constant of the barrier gradients for their analysis.
Idea of our algorithm.
We extend the first order algorithm of hinder2019poly to the zeroth order oracle case. To this end, we propose to estimate the gradients from zeroth order information using finite differences. Based on these estimates we can estimate the barrier gradients. At the same time, we ensure that the measurements are taken in the safe region despite lack of knowledge of the constraint functions.
3.2 Zeroth order gradient estimation.
We now construct estimators and of the gradient for each , based on the zeroth order information provided by EZO and SZO, respectively. We denote the differences between the estimators and the function gradient by and :
(7)  
(8) 
EZO estimator of the gradient.
For the exact oracle, we take measurements around the current point to estimate . Let be the th coordinate vector. For we can use the following estimator of ^{2}^{2}2Another way is to measure at and at , where is a random direction and use them for stochastic zeroth order estimator. In that case the dependence on dimensionality will be smaller, however the dependence on the confidence level will be much worse: instead of for our algorithm, since it then can be obtained only with multistarts using Markov inequality (c.r. ghadimi2013stochastic, yu2019zeroth). :
(9) 
Using (2), we can upperbound the deviation in (8) of this estimator from as follows (see Appendix I for the proof):
(10) 
SZO estimator of the gradient.
For the stochastic oracle, we take measurements around the current point . The number of measurements needs to be chosen dependent on
since the influence of the noise variance increases with the decreasing distance of measurements from each other. We define the vector of noises
where Then, the estimator with is given by(11) 
Lemma 1.
The deviation in (8) is bounded as:
To balance the above two terms inside the square root, we can choose , then
For the proof see Appendix F.
4 Algorithm
Having computed the gradient estimators for both exact and noisy oracles, we devise our proposed safe oraclebased optimization algorithm.
4.1 Safe zeroth order log barrier algorithm for EZO.
4.2 Safe zeroth order log barrier algorithm for SZO.
In case of a noisy zeroth order oracle, we construct the upper confidence bound for based on measurements taken around during the algorithm iterations.
Lemma 2.
where and
(14) 
For the proof see Appendix D.
We construct an estimator of
as :
(15) 
An upper confidence bound on is then
(16) 
The algorithm for SZO is as follows:
5 Safety and Convergence
5.1 Safety of 0LBM with EZO
To guarantee the safety of , the step size for barrier gradient step is restricted using Lipschitzness of the constraints.
Lemma 3.
For the measurements taken around at Step 4 of 0LBM are feasible. Moreover, if the step size for step is such that then , i.e., is feasible.
Proof.
For any point satisfying we have
The statement of the lemma follows directly. ∎
5.2 Safety of s0LBM with SZO
From Step 5 of the algorithm recall that and from Lemma 2 that Thus, directly from Lemma 3 we obtain the following result.
Lemma 4.
For any generated by s0LBM with probability it is true that . Moreover, given the measurements around the next point at Step 4 of s0LBM are also feasible with probability .
5.3 Convergence of 0LBM and s0LBM
Let us denote and , as the errors in the estimate of the gradient of the barrier functions for the EZO and SZO, respectively. Let Our approach is to first bound , and with high probability. Then we can construct a bound on the total number of steps and total number of measurements, required to find an approximate scaled KKT point.
EZO: Note that As such, it is easy to see that if , then
(17) 
SZO:
Lemma 5.
If , then
(18) 
For the proof see Appendix E.
Let us next bound the number of iterations of the s0LBM and 0LBM.
Lemma 6.
After iterations of the 0LBM or the s0LBM algorithm with
we obtain that for some , satisfy
For the proof of Lemma 6 see Appendix A. To guarantee that the point above is an approximated scaled KKT point, and need to be upper bounded by . Such a bound can be obtained using (5.3) and (20). Consequently, we can also bound the total number of measurements for 0LBM and s0LBM for convergence to an approximate scaled KKT point.
Theorem 2.
After iterations of 0LBM, there exists an iteration such that is an approximate scaled KKT point. The total number of measurements is
Theorem 3.
After iterations of s0LBM, there exists an iteration such that with probability greater than is a approximate scaled KKT point. The total number of measurements is
Note that in s0LBM, the total number of measurements is dependent on how close the iterations of the algorithm get to the boundary: . For specific cases, we can prove that are bounded from below by , because the barrier gradient direction will be pointing out of the boundary. Moreover, for such cases are bounded for all , which means that we get an unscaled KKT point.
Corollary 1.
Assume we have only one smooth constraint , . Also, assume that for all close enough to the boundary. Then, for generated by the s0LBM we can guarantee Hence, after iterations of the s0LBM for we find an unscaled approximate KKT point The total number of measurements is
For the proof see Appendix H.
It is also possible to extend this result for the case of multiple constraints, but then more regularity assumptions are needed.
Discussion.
We use two notions for analysing our algorithms: the number of iterations and the number of measurements . The number of steps of the first order method in hinder2019poly was similar to number of steps in our algorithm. In SafeOpt (berkenkamp2016bayesian) the number of measurements is such that (where is sublinear on ), to achieve the accuracy , however the complexity of each iteration is exponential in dimensionality due to solving a nonconvex optimization problem in it. This makes the approach inapplicable for high dimensional problems. Safe FrankWolfe algorithm from usmanova2019safe requires measurements for known convex objective and unknown linear constraints. The convexity of the problem simplifies the situation for the number of iterations to The fact that constraints are linear makes local information global to estimate the constraint set. In our case safety under nonconvexity, unknown constraints with noisy 1point feedback, and convergence with high probability come at cost.
6 Experiments
We consider the scenario of a cutting machine (maier2018turning) which has to produce certain tools and optimize the cost of production by tuning the turning process parameters such as the feed rate and the cutting speed . For the turning process we need to minimize a nonconvex cost function , where the decision variable is . The constraints include box constraints and a nonconvex quality roughness constraint
. We perform realistic simulations, by using the cost function and constraints estimated from hardware experiments with artificially added normally distributed noise
. The obtained nonconvex smooth optimization problem with concave objective and convex constraints is:Here , . Note that we assume the box constraints to be known, i.e., not corrupted with noise. However, the roughness constraint and the cost are assumed to be unknown and we only can measure their values. Hence, this problem is an instance of the safe learning problem formulated in (2). More details are presented by maier2018turning, who proposed to use Bayesian optimization to solve the problem. Although the Bayesian optimization used there indeed requires a small number of measurements, it is not safe and hence may require several measurements to be taken in the unsafe region. The roughness constraints are not fulfilled for unsafe measurements, i.e., the tools produced during unsafe experiments could not be realised in the market. That is why safety is necessary for this problem. Although there exist safe Bayesian optimization methods, they require strong prior knowledge in terms of suitable kernel function. We solve barrier subproblem (6) iteratively times using s0LBM with decreasing , where we fix . We set and rescaled so that The starting point is In Figure 1a) we show the performance of 0LBM, and in b) we run 20 realizations s0LBM. In all the realizations the method converges to a local optimum and the constraints are not violated.
7 Conclusion
We developed a zeroth order algorithm guaranteeing safety of the iterates and converging to a local stationary point. We provided its convergence analysis, which is comparable to existing zeroth order methods for nonconvex optimization, and demonstrated its performance in a case study.
References
Appendix A
Proof.
First, recall that Note that Recall that the steps are given by
(19) 
where the safe step size is In the paper (hinder2019poly) the authors have shown that
represents a "local" Lipschitz constant of at the point . In particular in Lemma 1(hinder2019poly) the authors have shown that for any and . Note that
For the next inequalities we also use the fact that Also we denote by Then, at each iteration of Algorithm 1 we have
Comments
There are no comments yet.