Many applications in robotics (schaal2010learning), manufacturing (maier2018turning)
, health sciences, finance, etc. require minimizing a loss function under constraints and uncertainty. Optimizing a loss function under partially revealed constraints can be further complicated by the fact that observations are available only inside the feasible set. Hence, one needs to carefully choose actions to ensure the feasibility of each iterate while pursuing the optimal solution. In the machine learning community, this problem is known assafe learning
. For such tasks, feasible optimization methods are required. There are many first and second order feasible methods in the literature. Although given noisy zero-th order oracle the Hessians are hard to estimate with good accuracy, we can approximate derivatives using finite differences. Most well known and widely used first order methods for stochastic optimization are dealing with constraints using projections. However, the lack of global knowledge of the constraint functionals makes it impossible to compute the corresponding projection operator.
There is a lack of zero-th order feasible (safe) algorithms for black-box constrained optimization in the literature. balasubramanian2018zeroth provide a comprehensive analysis of the performance of several zero-th order algorithms for non-convex optimization. However, the conditional gradient based algorithm (Frank-Wolfe) of balasubramanian2018zeroth for constrained non-convex problems requires global knowledge of the constraint functions, as the linear objective must be optimized with respect to these constraints. usmanova2019safe introduce a Frank-Wolfe based algorithm applied to the case of noisy zero-th order oracle for linear constraints. This work proves the feasibility of iterates with high probability and bounded the convergence rate with high probability, but does require convexity.
Non-convex non-smooth problems can be addressed by feasible methods, such as the first order Method of Feasible Directions (topkis1967convergence) or the second order Feasible Sequential Quadratic Programming (FSQP) algorithm (tang2014feasible). Another algorithm for non-convex non-smooth problems is given by facchinei2017feasible. The idea of this algorithm is to use as a direction of movement the minimizer of a local convex approximation of the objective subject to a local convex approximation of constraint set. Unfortunately, all the guarantees for the above methods are in terms of asymptotic convergence to a stationary point.
Another class of safe algorithms for global black-box optimization is based on Bayesian Optimization(BO) such as SafeOpt (sui2015safe) and its extensions (berkenkamp2016bayesian). The main drawback of these methods is that the computational complexity grows exponentially with dimensionality.
Interior Point Methods (IPM) are feasible approaches by definition, and they are widely used for Linear Programming, Quadratic Programming, and Conic optimization problems. By using self-concordance properties of specifically chosen barriers and second order information, these problems are shown to be extremely efficiently solved by IPM. However, in the cases when constraints are unknown, building the barrier with self-concordance properties is not possible. In these cases it is possible to use logarithmic barriers for general black-box constraints.hinder2018one propose to choose adaptive step sizes for the gradient algorithm for the log barriers and give the analysis of the convergence rate. The work of hinder2018one assume knowledge of the exact gradients of the cost and constraint functions. In the present work, we extend this approach for the case in which we only have a possibly noisy zero-th order oracle.
In this paper we propose the first safe zero-th order algorithm for non-convex optimization with a black-box noisy oracle. We prove that it generates feasible iterations with high probability, and analyse its convergence rate to the local optimum. Each iteration is computationally very cheap and does not require solving any subproblems such as those required for Frank-Wolfe or Bayesian Optimization based algorithms. In Table 1, we provide a comparison of our algorithm with the existing ones for unconstrained and constrained zero-th order non-convex optimization. In the first two algorithms a multiple point feedback is assumed, i.e., it is possible to measure at several points with the same noise realization. The convergence rate in the second column is proven for known polyhedral constraints. In our algorithm we assume a more realistic and more complicated setup where the noise is changing with each measurement. There are some works in zero-th order 1-point feedback for convex optimization with known constraints, for example bach2016highly require measurements to achieve accuracy.
|Problem||Unconstrained||Known constraints||Safe despite unknown constraints|
|Feedback||Noisy 2-point||Noisy 2-point||Noisy 1-point|
|Optimality criterion||Stationary point:
-approximate scaled KKT point:
|Number of measurements||
(no matrix inversion)
(Balasubramanian, Ghadimi, 2018)
(conditional gradient based,
requires optimization subproblems w.r.t. )
(Balasubramanian, Ghadimi, 2018)
does not require solving subproblems or matrix inversions)
2 Problem Statement
Notations and definitions.
Let , and denote -norm, -norm and -norm respectively on . A function is called -Lipschitz continuous if
It is called -smooth if the gradients are -Lipschitz continuous, i.e.,
which implies that .
We consider the problem of non-convex safe learning defined as a constrained optimization problem
where the objective function and the constraints are unknown continuous functions, and can only be accessed at feasible points . We denote by the feasible set
Let . The objective and the constraint functions for are -smooth on . Also, constraint functions for are -Lipschitz continuous on .
The feasible set has a non-empty interior, and there exists a known starting point for which for
We consider zero-th order oracle information. If we can measure the function values exactly, then one possible oracle is the Exact Zero-th Order oracle (EZO).
EZO: provides the exact values for any requested point
In many applications, the measurements of the functions are noisy. We assume that the additive noise is coming from a zero-mean
-sub-Gaussian distribution. We call this oracleStochastic Zero-th Order oracle (SZO).
SZO: provides the noisy function values , is a zero-mean -sub-Gaussian noise independent of previous measurements for any requested point . We assume that the noise values are independent over time and indices .
The condition is usually used as an optimality criterion in non-convex smooth optimization without constraints. It is well known that in the unconstrained case the classical gradient descent method converges with rate which matches the lower bound derived for this class of problems (carmon2017lower). In the non-convex constrained optimization, first order criteria are Karush-Kuhn-Tucker (KKT) conditions, which are necessary in the presence of regularity conditions called Constraint Qualification (CQ). In such cases, we can measure the solution accuracy by satisfying approximate KKT conditions. The point is called a KKT point if it satisfies the necessary condition for local optimality:
is the vector of dual variables andis the Lagrangian. There are several ways to define an approximate KKT point, see (cartis2014complexity; birgin2016evaluation). Similar to cartis2014complexity, we define an approximate KKT point as follows. An -approximate scaled KKT point (s-KKT) for some is a pair which satisfies:
Note that the approximation lies in substituting the equality constraints of the standard KKT with inequalities (s-KKT.2), (s-KKT.3). In case can be uniformly bounded by some constant , the scaled -approximate KKT point is an unscaled -approximate KKT point.
3.1 Log barrier algorithm.
We address the safe learning problem using the log barriers approach. The log barrier function and its gradient are defined as follows:
The main idea of the log barrier methods is to solve a sequence of the barrier subproblems
with decreasing for some For the rest of the paper we fix to be constant, since we propose an algorithm to solve the subproblems, which is challenging given only noisy function evaluations. Let us set the pair of primal and dual variables to be , where for Then, one can verify the following properties:
Now recall (s-KKT.1)-(s-KKT.3) and suppose that is such that Then, one can verify that the pair is an -approximate scaled KKT point.
First order log barrier algorithm (hinder2019poly).
For structured problems like conic optimization with known self-concordant barriers and computable Hessians, to solve the log barrier subproblems (6) barrier methods classically use the Newton algorithm. However, since we assume that the structure is unknown, and the second order information is inaccessible, we would like to use gradient descent type methods with the step direction to find a solution of the barrier subproblem (6). The main drawback of solving and analysing the log barrier subproblems using gradient methods is that the log barriers themselves are non-Lipschitz continuous and non-smooth functions since their gradients grow to infinity on the boundary. This might lead to unstable behaviour of the gradient based algorithm close to the boundary, and the step sizes have to be chosen exponentially small. To handle this drawback, in the paper (hinder2019poly) the authors proposed to choose an adaptive step size , where represents a local Lipschitz constant of at the point . The convergence rate for finding the solution of the subproblem , i.e., convergence to an -approximate KKT point using their algorithm with adaptive step sizes is formulated in Theorem 1.
(Claim 2. (hinder2019poly)) Under Assumption 1 for any constant , after at most iterations such that
the procedure finds a point such that
hinder2019poly were first to derive such rates for first order log barrier methods for general -Lipschitz, -smooth functions because of their adaptive step size. This choice allowed them to define and use local Lipschitz constant of the barrier gradients for their analysis.
Idea of our algorithm.
We extend the first order algorithm of hinder2019poly to the zero-th order oracle case. To this end, we propose to estimate the gradients from zero-th order information using finite differences. Based on these estimates we can estimate the barrier gradients. At the same time, we ensure that the measurements are taken in the safe region despite lack of knowledge of the constraint functions.
3.2 Zero-th order gradient estimation.
We now construct estimators and of the gradient for each , based on the zero-th order information provided by EZO and SZO, respectively. We denote the differences between the estimators and the function gradient by and :
EZO estimator of the gradient.
For the exact oracle, we take measurements around the current point to estimate . Let be the -th coordinate vector. For we can use the following estimator of 222Another way is to measure at and at , where is a random direction and use them for stochastic zero-th order estimator. In that case the dependence on dimensionality will be smaller, however the dependence on the confidence level will be much worse: instead of for our algorithm, since it then can be obtained only with multi-starts using Markov inequality (c.r. ghadimi2013stochastic, yu2019zeroth). :
SZO estimator of the gradient.
For the stochastic oracle, we take measurements around the current point . The number of measurements needs to be chosen dependent on
since the influence of the noise variance increases with the decreasing distance of measurements from each other. We define the vector of noiseswhere Then, the estimator with is given by
The deviation in (8) is bounded as:
To balance the above two terms inside the square root, we can choose , then
For the proof see Appendix F.
Having computed the gradient estimators for both exact and noisy oracles, we devise our proposed safe oracle-based optimization algorithm.
4.1 Safe zero-th order log barrier algorithm for EZO.
4.2 Safe zero-th order log barrier algorithm for SZO.
In case of a noisy zero-th order oracle, we construct the upper confidence bound for based on measurements taken around during the algorithm iterations.
For the proof see Appendix D.
We construct an estimator of as :
An upper confidence bound on is then
The algorithm for SZO is as follows:
5 Safety and Convergence
5.1 Safety of 0-LBM with EZO
To guarantee the safety of , the step size for barrier gradient step is restricted using Lipschitzness of the constraints.
For the measurements taken around at Step 4 of 0-LBM are feasible. Moreover, if the step size for step is such that then , i.e., is feasible.
For any point satisfying we have
The statement of the lemma follows directly. ∎
5.2 Safety of s0-LBM with SZO
For any generated by s0-LBM with probability it is true that . Moreover, given the measurements around the next point at Step 4 of s0-LBM are also feasible with probability .
5.3 Convergence of 0-LBM and s0-LBM
Let us denote and , as the errors in the estimate of the gradient of the barrier functions for the EZO and SZO, respectively. Let Our approach is to first bound , and with high probability. Then we can construct a bound on the total number of steps and total number of measurements, required to find an -approximate scaled KKT point.
EZO: Note that As such, it is easy to see that if , then
If , then
For the proof see Appendix E.
Let us next bound the number of iterations of the s0-LBM and 0-LBM.
After iterations of the 0-LBM or the s0-LBM algorithm with
we obtain that for some , satisfy
For the proof of Lemma 6 see Appendix A. To guarantee that the point above is an approximated scaled KKT point, and need to be upper bounded by . Such a bound can be obtained using (5.3) and (20). Consequently, we can also bound the total number of measurements for 0-LBM and s0-LBM for convergence to an approximate scaled KKT point.
After iterations of 0-LBM, there exists an iteration such that is an -approximate scaled KKT point. The total number of measurements is
After iterations of s0-LBM, there exists an iteration such that with probability greater than is a -approximate scaled KKT point. The total number of measurements is
Note that in s0-LBM, the total number of measurements is dependent on how close the iterations of the algorithm get to the boundary: . For specific cases, we can prove that are bounded from below by , because the barrier gradient direction will be pointing out of the boundary. Moreover, for such cases are bounded for all , which means that we get an unscaled KKT point.
Assume we have only one smooth constraint , . Also, assume that for all close enough to the boundary. Then, for generated by the s0-LBM we can guarantee Hence, after iterations of the s0-LBM for we find an unscaled -approximate KKT point The total number of measurements is
For the proof see Appendix H.
It is also possible to extend this result for the case of multiple constraints, but then more regularity assumptions are needed.
We use two notions for analysing our algorithms: the number of iterations and the number of measurements . The number of steps of the first order method in hinder2019poly was similar to number of steps in our algorithm. In SafeOpt (berkenkamp2016bayesian) the number of measurements is such that (where is sub-linear on ), to achieve the accuracy , however the complexity of each iteration is exponential in dimensionality due to solving a non-convex optimization problem in it. This makes the approach inapplicable for high dimensional problems. Safe Frank-Wolfe algorithm from usmanova2019safe requires measurements for known convex objective and unknown linear constraints. The convexity of the problem simplifies the situation for the number of iterations to The fact that constraints are linear makes local information global to estimate the constraint set. In our case safety under non-convexity, unknown constraints with noisy 1-point feedback, and convergence with high probability come at cost.
We consider the scenario of a cutting machine (maier2018turning) which has to produce certain tools and optimize the cost of production by tuning the turning process parameters such as the feed rate and the cutting speed . For the turning process we need to minimize a non-convex cost function , where the decision variable is . The constraints include box constraints and a non-convex quality roughness constraint
. We perform realistic simulations, by using the cost function and constraints estimated from hardware experiments with artificially added normally distributed noise. The obtained non-convex smooth optimization problem with concave objective and convex constraints is:
Here , . Note that we assume the box constraints to be known, i.e., not corrupted with noise. However, the roughness constraint and the cost are assumed to be unknown and we only can measure their values. Hence, this problem is an instance of the safe learning problem formulated in (2). More details are presented by maier2018turning, who proposed to use Bayesian optimization to solve the problem. Although the Bayesian optimization used there indeed requires a small number of measurements, it is not safe and hence may require several measurements to be taken in the unsafe region. The roughness constraints are not fulfilled for unsafe measurements, i.e., the tools produced during unsafe experiments could not be realised in the market. That is why safety is necessary for this problem. Although there exist safe Bayesian optimization methods, they require strong prior knowledge in terms of suitable kernel function. We solve barrier sub-problem (6) iteratively times using s0-LBM with decreasing , where we fix . We set and re-scaled so that The starting point is In Figure 1a) we show the performance of 0-LBM, and in b) we run 20 realizations s0-LBM. In all the realizations the method converges to a local optimum and the constraints are not violated.
We developed a zero-th order algorithm guaranteeing safety of the iterates and converging to a local stationary point. We provided its convergence analysis, which is comparable to existing zero-th order methods for non-convex optimization, and demonstrated its performance in a case study.
First, recall that Note that Recall that the steps are given by
where the safe step size is In the paper (hinder2019poly) the authors have shown that
represents a "local" Lipschitz constant of at the point . In particular in Lemma 1(hinder2019poly) the authors have shown that for any and . Note that
For the next inequalities we also use the fact that Also we denote by Then, at each iteration of Algorithm 1 we have