When learning a policy on a robot, it is very often the case that some roll-outs result in failure behaviors, which forbid the robot from completing the task. In those cases, the system can largely (and often quickly) diverge from the desired behavior, causing the need of a premature experiment detention, for example, by pressing an emergency button. Then, it is unclear what cost should be assigned to that failure case. Intuitively, the designer would heuristically choose a high penalty so that similar policies are never visited again. For example, (Marco et al., 2016) propose a method to automatically tune controller parameters of a humanoid robot that learns to balance an inverted pole. Therein, unsuccessful balancing experiments were penalized with a large number, chosen heuristically a priori. In (Rai et al., 2017), a walking controller of a bipedal robot is learned from data, using a tailored cost function that penalizes with a high cost roll-outs in which the robot falls down. The authors (Tu and Recht, 2018) learn the parameters of an LQR feedback policy from data in a reinforcement learning (RL) setting, and assign an infinite cost when the policy destabilizes the closed-loop behavior. Finally, in (Kober and Peters, 2009), RL is used for the ball-in-a-cup problem, assigning a constant cost to roll-outs in which the ball does not pass upwards the rim of the cup.
While these methods show successful learning experiments, choosing appropriate penalties for failure cases is a key component of the reward design, and unclear a priori (Chatzilygeroudis et al., 2018). In failure cases (e.g., unstable controllers), deciding on a penalty is non-trivial, because:
The immediate cost values might be orders of magnitude too large and not comparable against those of the stable policies.
Due to the premature experiment detention, the collected data can be too scarce, and not comparable with that of the stable policies, whose experiments lasted for longer time (Marco et al., 2016).
After stopping the experiment, sensor measurements could be difficult, or impossible to access.
To circumvent these problems, the cost designer typically ignores the acquired data at failure cases and commits to ad hoc choices, which are decided and fixed before doing any experiment. While this might sound reasonable, prior to any data collection and without expert knowledge about the system, the designer does not know how to choose such penalty. Furthermore, arbitrary choices come at the risk of laborious and repeated manual tuning: During learning it is possible that stable roll-outs retrieve a cost worse than the arbitrary penalty chosen a priori, which forces a re-design of the cost function, and thus, possibly a restart of the learning experiments. Generally speaking, choosing an appropriate penalty for failures is problem-dependent and non-trivial. Moreover, failure cases need to be treated separately when designing the cost function, typically requiring manual tuning effort.
In the context of robot learning, Bayesian optimization (BO) (Shahriari et al., 2016) can be seen as a form of RL for policy optimization. Therein, the latent cost function represents an unknown mapping from the parameters of a policy executed on the robot to its performance on completing some task (von Rohr et al., 2018; Rai et al., 2017; Marco et al., 2016; Calandra et al., 2016; Berkenkamp et al., 2016). Such cost is typically modeled in a probabilistic setting and minimized iteratively through experiments. BO incorporates a probabilistic prior about the objective cost typically through a Gaussian process (GP) regression model (Rasmussen and Williams, 2006)
. This is a non-parametric model that allows for inference on the latent cost at unobserved locations, conditioned on collected data.
The core contribution of this paper is a novel Bayesian model for the cost function, such that the penalty term does not need to be defined a priori, but instead is learned from data. The likelihood of the Bayesian model that can tackle a hybrid dataset, inside which stable policies yield a continous-valued cost observation, and unstable policies only yield the discrete label “unstable”
. The consequent intractable posterior is approximated to obtain a tractable GP model. In addition, the GP model is used within BO to reveal the optimal policy by sequentially selecting experiments that avoid unstable areas with high probability. Because the resulting model serves both, as a probabilistic regressor, and classifies regions as stable or unstable, we call it Gaussian process forclassified regression (GPCR). We illustrate the problem of learning with failures with a case example. Consider tuning the policy parameters of a humanoid robot doing a simple task, like squatting. The goal is to find the parameters that minimize the tracking error of the movement. Such error is measured through a cost function that is evaluated after each experiment. During learning, we observe that some controllers are making the robot fall down, and some others cause huge vibrations on the joints. Since those cases are undesired, and considered as unstable, an operator prematurely stops the experiment pressing an emergency button. We consider all these cases as failures, for which no meaningful data is available.
In Sec. 1, the unstable cases can be seen as consequence of two external constraints being violated, i.e., the velocity of the joints and the tilting angle of the torso surpassing certain thresholds. In case such thresholds are known, one can know during learning how close is a specific controller from being unstable by measuring the sensor signals. However, prior to any data collection, the designer does not know what those thresholds are, i.e., neither the largest joint velocities right before vibrations start to happen, nor the maximum angular position of the torso before the robot falls down. The proposed GPCR model can also be used to model such constraints when the constraint thresholds are unknown. Furthermore, GPCR
estimates such unknown thresholds from data. This is the second core contribution of the paper.
In addition, the resulting constrained problem is addressed using Bayesian optimization methods under the presence of unknown constraints (BOC). To this end, recent BOC methods motivated by information theory (Hernández-Lobato et al., 2016) have demonstrated to outperform existing heuristics. However, they are computationally demanding. Instead, we have extended a recent information-theoretic Bayesian optimizer, Max-Value Entropy Search (Wang and Jegelka, 2017), which is computationally cheaper, to account for unknown constraints. The resulting algorithm is called Min-Value Entropy Search with unknown level-set constraints (mESCO). That is the third contribution of this work.
The paper is structured as follows: In Sec. 2, we formulate the constrained controller tuning problem, and define the key mathematical elements of the paper. In Sec. 3, we present the novel Bayesian model (GPCR) in the context of unknown failure penalties. In Sec. 4, a range of applications of the GPCR model are presented, including the extension to model black-box constraints, and specially the case in which constraint thresholds are unknown. In Sec. 5, Bayesian optimization with unknown constraints is explained, along with our extension (mESCO) of Max-Value Entropy Search. In Sec. 6, we discuss our contributions among the existing literature. In Sec. 7, we demonstrate the benefits on GPCR on several numerical simulations and experiments on a real robotic platform, and we conclude with a discussion in Sec. 8.
In this section, we start with a general formulation of the controller tuning problem. We then define mathematically the unknown instability threshold, a key parameter of the GPCR model, and finally introduce notation for Gaussian process (GP).
2.1 Controller tuning problem
Let a system be defined by discrete unknown dynamics , where is the state, is the control signal, and is process noise, at time . We assume the control signal is obtained from a feedback policy . Generally, we are interested in policies that make the system accomplish a specific user-defined task, for example, picking an object with a robot arm and placing it somewhere else on the table. Such policies drive the system from the initial state to a terminal state describing a time trajectory as a sequence of states and control inputs . The cost of the system for not accomplishing the desired task is specified via a cost function . The cost of a full trajectory under the policy is then defined as . While the true systems dynamics are unknown, we assume to have access to an approximate model . In cases in which the model and the cost have certain structure, it is possible to compute an optimal policy
For example, when the model is linear and the cost is quadratic, the optimal policy is the linear quadratic regulator (LQR) (Anderson and Moore, 1990). Generally speaking, while such policy is optimal for the model , it is suboptimal for the true dynamics . In order to recover the optimal policy for , we parametrize with parameters and learn the optimal parameters by collecting experimental data from the true system . To this end, we obtain a different cost of a full trajectory for each parametrization . This induces an objective , which we write as to simplify notation. Then, the learning problem is to find the parametrization that minimizes the trajectory cost
on the true dynamics . The function is unknown, as it depends on unknown dynamics , but we can query it as a black-box function at location by doing an experiment on the real system. For instance, in (Marco et al., 2016) the weighting matrices of an LQR controller design are parametrized with parameters . Therein, the goal is to minimize an unknown cost objective that quantifies the performance of a real robot balancing an inverted pole. Thus, querying the function at location consists of doing an experiment with policy and collecting sensor data to compute the corresponding cost value .
2.2 Instability threshold as a level-set constraint
In the RL literature, the data collected with an unstable controller is typically discarded, and a relatively large value is heuristically chosen to characterize its performance. Whereas we propose to also discard such data, we never choose any heuristic penalty. These ideas can be mathematically reflected in the objective as follows.
Let us divide the domain in two sets: The set of stable controllers , for which function values are known, and the set of unstable controllers , for which function values are somewhat large, but unknown, with . To include these concepts in the objective , we re-define it so that its image is
where is lower bounded by the global minimum , and the image of the objective over the stable set is upper bounded by a threshold , which is defined next.
The instability threshold is the worst possible cost excluding those of the unstable set .
In the context of Sec. 1, the instability threshold can be seen as the worst squatting possible to observe without destabilizing the robot. From creftype 3, it follows that the threshold induces a level set , with . In addition, we say that imposes a level set-constraint in that divides the input space into the stable set and the unstable set . Finally, since the values of that correspond to stable controllers are “constrained” to live below , we say that the function is self-constrained.
It is important to remark that without complete knowledge of the function , and prior to any data collection (i.e., never having realized an experiment on the system), is unknown to the designer. However, in Sec. 3.4 we show that can be estimated from data, and that such estimation is useful to target stable regions when searching for the optimum creftype 2.
We have introduced the instability threshold , a key concept that we use to characterize instability in the context of the image of the objective . This concept is important, as it will be recurrently used in the remainder of the paper.
2.3 Gaussian process (GP)
. This is a probability distribution over the function, such that every finite restriction to random function values
follows a multivariate Gaussian distribution. That means, the objective functionis modeled as , with prior mean , and positive definite covariance function (also called kernel) . The kernel encodes the strength of statistical correlation between two latent function values and , where . The model proposed in Sec. 3 assumes a zero-mean prior, .
In the remainder of this paper we will refer to controller parametrization simply as controller . In addition, we will refer to a stable/unstable controller as stable/unstable .
3 Gaussian process model for classified regression (Gpcr)
In the following, we present the Bayesian model for classified regression and how to approximate it as a Gaussian process. Such model for the objective function is presented in creftype 2. This model inherently encodes the particular distinction between unstable and stable controllers through the instability threshold defined in Sec. 2.2.
3.1 Likelihood model for unknown penalties
Let us assume that for each parameter we have access to two types of observations: (i) a binary label that determines whether the experiment was unstable or stable, and (ii) a noisy cost value only when the experiment is stable:
where we have used the shorthand notation , , and . As stated in Sec. 2.2, we assume that all stable controller parametrizations yield noise-corrupted observations , with . In the context of the problem described in Sec. 2.1, the observation noise arises from the process noise in the system, and the need of approximating the expectation from creftype 1 in practice (Marco et al., 2016). For unstable controller parametrizations , we assume that no continuous-valued observation was obtained, but only a discrete label.
The graphical model in Fig. 1 expresses the relation between variables. For a given , having full information about is sufficient to determine whether the controller is stable or not (), and whether we have access to a cost value observation or not (). Notice that the noisy evaluation is not fully determined by , but also requires the additional information about the associated label (cf. (4)). Such model can be encoded in a likelihood over the observation of the latent function at location as
where the dependency on in the right-hand side is redundant, but it has been kept for clarity. is the Heaviside function, i.e., , if , and otherwise. The likelihood creftype 5 captures our knowledge about the latent function: When is unstable, all we know about is that it takes any possible value above the threshold , with all values being equally likely, but we never specify what that value is. If is stable, all the probability mass falls below for consistency, and the indicator function toggles the contribution of the Gaussian.
Let the latent objective be observed at locations , which entails both, stable and unstable controllers , with . The corresponding latent objective values at the stable locations are , and at
. We group the latent objective values in a vector. The latent function values induce observations grouped in the set . Since contains both discrete labels and real scalar values, we refer to it as a hybrid set of observations. We assume such observations to be conditionally independent and identically distributed following creftype 5, which allows the likelihood to factorize . Then, the likelihood over the data set becomes
3.2 Posterior probability
The posterior follows from Bayes theorem
where , , and are the prior covariance matrices expressing the correlation between function evaluations at stable (s) and/or unstable (u) locations. The last two terms are the product of two multivariate Gaussians of different dimensionality. This is equal to another unnormalized Gaussian
, whose mean, variance and scaling factor depend on the observations and the noise, i.e.,, and , respectively (cf. Appendix A). Hence, the posterior becomes
where the scaling factor has been omitted for simplicity. The role of the Heaviside functions is to restrict the support of to an unbounded hyperrectangle with a single corner at location . Thus, creftype 9 can be geometrically understood as the product of a multivariate Gaussian density of dimension
times an activation function that sets to zero any probability outside such hyperrectangle. Computing the right-hand side ofcreftype 9 is analytically intractable. However, it can be approximated with an unnormalized Gaussian
where , , and are computed following (Cunningham et al., 2011). Therein, the posterior of a multivariate Gaussian, whose support is restricted to bounded polyhedra (i.e., a more general case than the one from creftype 9), is approximated using Expectation propagation (EP) (Minka, 2001). Our implementation of such EP approximation performs comparably to computing the posterior using sampling methods, e.g., Elliptical Slice Sampling (Murray et al., 2010). However, we have found EP to need much lower wall-clock time to yield the same results.
3.3 Predictive probability
Our goal is to make inference at an unobserved location , where the predictive latent function value
is jointly distributed with the vector of latent function values at the observations, i.e., . The joint posterior is
where the likelihood term does not depend on the unobserved location , nor on . The predictive distribution of conditioned on the collected data can then be obtained by marginalizing over the latent function values from the joint posterior as
where the likelihood term is defined as in creftype 6, and we extend the prior with the unobserved location as
The multivariate integral in creftype 12 is analytically intractable due to the structure of the likelihood creftype 6. However, we can alleviate this by applying some transformations. First, by combining and creftype 7, the integral from creftype 12 can be rewritten as . Then, using the Gaussian approximation creftype 10 of the posterior , the resulting predictive distribution is also Gaussian , with mean and variance as a function of the mean and covariance of the posterior (Rasmussen and Williams, 2006, Sec. 3.4.2). Using such approximated posterior creftype 10
, the moments ofare analytically given by
We illustrate in Fig. 2 (top) the approximate posterior for the setting described in creftype 14. A detailed analysis of the quality of the approximation to can be found in Appendix B. Consider the unknown objective , , for which we have three stable evaluations , and two unstable evaluations . We assume small noise , and a zero-mean prior with covariance , whose entries are computed with an ARD Matérn kernel with signal variance and lengthscale .
The univariate distribution is straightforwardly extended to a set of unobserved locations , where we consider the joint prior . This gives rise to a multivariate approximate Gaussian posterior , i.e., a Gaussian process.
The threshold influences the predictive distribution through the likelihood model creftype 6, which induces an intractable posterior that, when approximated as a multivariate Gaussian, yields a Gaussian process model. In addition, the threshold can be seen as a discriminator that distinguishes instability from stability in the axis of the cost value. Because of the closeness to the idea of classification, but reinterpreted in the regressor, we call this model: Gaussian process for classified regression (GPCR).
In the next section we give detailed insights about the influence of in the model, and also how it can be learned from data.
3.4 Instability threshold as a model parameter
The instability threshold plays an important role in the likelihood model creftype 6. However, prior to data observation, one has in principle no correct notion about an appropriate value for it. Instead, we can compute an estimate from data using GPCR. Herein, we show two possible ways of computing .
3.4.1 Estimation via maximum likelihood (ML)
One possible way to estimate from data is by maximizing the marginal likelihood , which can be computed by integrating out from the right-hand side of creftype 9. Since that is analytically intractable, we instead use the integral over the approximate posterior creftype 10 to compute its marginal
where the dependency on has been introduced in for clarity. From creftype 10 follows that . Thus, solving creftype 15 implies performing a line search, for which every call to the function involves solving an EP loop.
3.4.2 Estimation via maximum a posteriori (MAP)
Another possible way of estimating is to consider it as a stochastic variable distributed with a hyperprior , and compute the maximum of the posterior distribution , where we have assumed that and are uncorrelated. Using the approximate posterior creftype 10 over , and Bayes’ rule, we obtain the posterior Due to the lack of prior knowledge about
, we propose a wide Gaussian hyperprior, whose parameters are to be selected by the designer. Similar to creftype 15, we find by solving the optimization problem
An advantage of ML over MAP is that the former requires no prior reasoning about at all. However, we have found in our experiments that ML can lead to inconvenient estimations of when stable evaluations are absent. In such case, we see from creftype 9
that the posterior probability becomes larger since the termactivates a larger portion of the Gaussian, and has a maximum at . Such estimation can cause problems when using the model to perform Bayesian optimization under unknown constraints, as explained in Sec. 5. Analogously, when only stable evaluations are present, the maximum is found at . On the contrary, MAP enjoys the hyperprior over as a regularizer that avoids such extreme point estimates.
The estimated instability threshold becomes important when determining the probability of a specific unforeseen location resulting in a stable controller. In the next section, we describe how to compute such probability.
3.5 Probabilistic stability constraint in Gpcr
Given our current estimate of the instability threshold , we can compute the probability of a specific point being a stable observation, i.e., . From Sec. 2.2, it follows that . After having observed data , we define
4 Applications and extensions of Gpcr
The GPCR framework introduced in Sec. 3 can be used to model the objective function from creftype 2, but it is not limited to it. One could think about the class of unstable controllers as something wider, i.e., any controller that results in any undesired behavior. For example, a robot arm achieving a stable trajectory, but hitting an obstacle on the way. Such undesired behavior can be measured, for example, by computing the euclidean distance from the robot arm to the obstacle. This information is much richer than a discrete label to indicate success/failure and could, for instance, be treated as an additional external black-box function, modeled as a standard GP 333By “standard GP” we mean the most common use of a GP, i.e., using a Gaussian likelihood, as opposed to the GPCR model proposed herein. , and same for the objective. In such a case, at first sight it seems like Bayesian optimization under unknown constraints (Gelbart et al., 2014; Gardner et al., 2014) suffices to address the constrained problem and GPCR is not needed. However, there exist many scenarios in which GPCR comes at handy; for example, when the constraint threshold is unknown (i.e., in the aforementioned example with the robot arm, we do not know a priori where the obstacle is). In that case, one can model the constraint using GPCR, and see the unknown constraint threshold as the instability threshold introduced in Sec. 3.4. By doing so, we could also learn the constraint threshold from data.
Including the aforementioned case, we have identified in total four different cases where the GPCR model plays an important role in either modeling the constraint, the objective, or both. However, before explaining each case, we need to introduce some new notation and concepts that will make the differences clear. Let us assume the optimization problem creftype 2 is constrained under black-box external constraints, which results in the constrained minimization problem
where are the constraint thresholds. From now on, we assume that evaluations are coupled, i.e., when evaluating at , we obtain measurements from both the objective, and all the constraints (Hernández-Lobato et al., 2016). Additionally, we distinguish between “stable” and “safe” controllers: A controller is considered unsafe, but stable, when it violates at least one of the constraints, but it keeps the system stable (e.g., the robot arm performs a stable trajectory, but it finds an obstacle in the way). Similarly, is considered safe, but unstable, when it does not violate any constraint, but destabilizes the system (e.g., the robot arm executes a poor controller, which destabilizes the system, but it does not hit any obstacle). Consequently, we redefine more broadly as to include both, stable and safe controllers, as opposed to the definition from Sec. 2.2. Although the evaluations are coupled “in location”, we assume they are uncoupled “in observation”. For example, an unsafe but stable controller yields a discrete label when evaluating the constraint, but a continuous-valued observation when evaluating the objective.
The nature of the constraints creftype 18 is important when deciding whether the GPCR model is useful or not. More concretely, when it comes to obtain a query from a constraint , we distinguish two characteristic cases: (i) the query value exists when it is satisfied, and it does not exist when it is violated (instead a binary label is obtained), and (ii) the query value does not exist in either case, and we only obtain a binary outcome that determines constraint satisfaction/violation. In the first case, we characterize as a level-set constraint, while in the second case, we characterize it as a binary constraint. For modeling a level-set constraint, GPCR has benefits over standard GPs, because it can estimate the constraint threshold during learning. For modeling a binary constraint, GPCR has no special benefit, as the dataset is not hybrid (cf. Sec. 3.1), but binary. Given these two clearly distinctive groups, let us assume level-set constraints and binary constraints, with .
Using the aforementioned details, we discuss separately the benefits of GPCR in four different scenarios that arise from particular characterizations of the constraint and the objective. The explanations are supported with illustrations (Fig. 3) on a simple one dimensional setting with one objective function and one constraint. In all cases, we also show the decision-making of the Bayesian optimizer that addresses the constrained minimization problem creftype 18, and will be detailed in Sec. 5. As a quick reference for the reader, we briefly summarize here the four cases:
Single constraint equal to the objective; ; self-constrained objective (Sec. 4.1).
Multiple binary constraints; ; constraints absorbed by the objective (Sec. 4.2).
Multiple level-set constraints; ; objective modeled as a standard GP (Sec. 4.3).
Self-constrained objective, and multiple level-set constraints; and (Sec. 4.4).
4.1 Case 1) Self-constrained objective
The general problem formulation creftype 18 finds its simplest case when the objective is also the constraint, i.e., , with and . This case is equivalent to the one explained in Sec. 2.2: The threshold imposes a level-set constraint on (we say is self-constrained), with stability region . Fig. 2(a) shows a possible situation corresponding to such a scenario. The objective is modeled with GPCR, and the success/failure binary information is captured by it. The true constraint is also shown for clarity to emphasize that it is identical to the true objective, but it does not play any role. We see that GPCR pushes the probability mass above the current estimate of the instability threshold , in areas with unstable evaluations. The first clear benefit of GPCR in this case is that no heuristic cost needs to be assigned to unstable controllers by the designer. The second, is that the Bayesian optimizer (bottom plot of Fig. 2(a)) sees those regions as less interesting because the cost is predicted to be high.
4.2 Case 2) Multiple binary constraints
In this case, we assume all constraints to be binary (). Such constraint provides only discrete labels indicating success/failure. Thus, instead of modeling it separately (e.g., using a Gaussian process classifier), its discrete information is directly absorbed in the GPCR model used to model the objective. In Fig. 2(b), the region of stability is given by . When evaluating in the middle region, where the constraint is violated, three unstable controllers are obtained. Therefore, the GPCR model pushes the probability mass above . That way, those regions have a high probable cost. The first clear benefit of GPCR in this case is that one model handles both, the information of the objective and the constraint. The second clear benefit of GPCR is that it automatically makes the unstable region less interesting for the Bayesian optimizer, which selects the next evaluation in a stable area. The estimated instability threshold is found above all collected stable points, but in this case it has no meaning beyond its role as a parameter within the GPCR model. It merely serves to push the cost probability above already collected stable evaluations.
4.3 Case 3) Multiple level-set constraints
Herein, we assume all constraints to be level-set constraints (). In Fig. 2(c), the constraint is a level-set constraint, modeled with GPCR, with the stable set determined by , and being unknown a priori. On the contrary, the objective function itself is not level-set constrained, and is modeled with a standard GP. The Bayesian optimizer searches the constrained minimum in areas where the constraint is satisfied with high probability. The estimated instability threshold (orange horizontal line) herein estimates the true constraint threshold (grey horizontal line). The main benefit of GPCR in this case is that does not need to be known by the designer a priori. Instead, it can be estimated while collecting data.
4.4 Case 4) Multiple binary and level-set constraints
Last, we discuss the case in which and .
Generally, we model each of the level-set constraints independently with a GPCR model, and we consider the binary constraints absorbed into the GPCR model for . In total, this comprises models. This is a more general case than the ones before, and mixes Sec. 4.2 and Sec. 4.3 in one. Fig. 2(d) shows three main advantages of GPCR in this scenario: (i) The designer does not need to know a priori the threshold for the the level-set constraints, (ii) the binary constraints do not need to be modeled in addition, but they are rather absorbed by the model for , and (iii) the designer does not need to define a penalty value for .
It is worth remarking that this case is equivalent to having level-set constraints and a self-constrained objective. The reasoning is that, given a self-constrained objective, one can always construct a set of virtual binary constraints, which combined together, divide the space into the same stable/unstable areas implied by the instability threshold of the self-constrained objective.
We have illustrated the benefits of GPCR and its capability to handle multiple constraints in four different simple scenarios. However, it still remains a question how to solve creftype 18 for each of the aforementioned four cases. In the next section, we show how constrained Bayesian optimization can be leveraged to address creftype 18, and explain the nuisances for each of the four cases.
5 Constrained Bayesian optimization and Gpcr
Standard Bayesian optimization (BO) attempts to solve problem creftype 2 by sequentially deciding which controller shall be evaluated in the next experiment. Such decision is made based on a specific criterion, e.g., maximize information gain about the optimum (Hennig and Schuler, 2012), trade-off exploration vs. exploitation (Srinivas et al., 2010), or improve upon the best point observed so far (Jones et al., 1998). For an extensive review on the recent BO methods, see (Shahriari et al., 2016). The aforementioned criteria are represented by an acquisition function , typically computed through the Gaussian process model on , updated with the function values revealed up to the current iteration. The maximizer of is used to decide on the next experiment .
In presence of black-box constraints, Bayesian optimization with unknown constraints (BOC) (Griffiths and Hernández-Lobato, 2017; Gelbart, 2015; Gardner et al., 2014; Snoek, 2013; Gelbart et al., 2014; Schonlau et al., 1998) has been proposed to address problem creftype 18. Therein, the main idea is to target experiments where the constraints are satisfied with high probability.
In the following, we discuss the benefits of GPCR in the context of BOC and BO for the four cases described in Sec. 4. In Sec. 5.2 we discuss that, when all constraints are binary (i.e., Sec. 4.2), these can be absorbed into the GPCR model, and thus standard BO suffices to solve the constrained optimization problem. Furthermore, in Sec. 5.1 we show that, when at least one of the constraints is a level-set constraint, we can incorporate it into the optimization problem, and address it using constrained Bayesian optimization.
Although existing BOC methods have proven to be successful in practice (Gelbart et al., 2014), the underlying decision-makers are heuristics developed based on intuition. Alternatively, entropy-based methods (Hennig and Schuler, 2012), constructed over the principle of information gain about the optimum, have proven to outperform the aforementioned heuristics. However, entropy-based methods able to handle multiple constraints are scarce and computationally expensive, i.e., only (Hernández-Lobato et al., 2016), to the best of our knowledge. Therefore, we propose to extend a recent entropy-based criterion (Wang and Jegelka, 2017), which is computationally cheaper than (Hernández-Lobato et al., 2016), to account for unknown constraints.
5.1 Bayesian optimization with unknown level-set constraints
Sec. 4.3 and Sec. 4.4 illustrate the case in which at least one of the constraints of creftype 18 is a level-set constraint, modeled with GPCR. Since such model describes the constraint probabilistically, we cannot guarantee that the constraint will be satisfied/violated in locations where observations have not been collected. Furthermore, even at locations where the constraint is satisfied and a continuous-valued observation is acquired, we are unable to guarantee constraint satisfaction/violation, because the observation is contaminated with noise. As a result, it is not possible to ensure that the constraints are satisfied for any . In light of these challenges, the minimizer of problem creftype 18 can by approximated by
where represents some user-defined level of confidence, with , and we have assumed that the constraints are modeled independently. This formulation has been proposed by (Gelbart et al., 2014; Gramacy and Lee, 2011). The idea is to minimize the objective function in expectation while satisfying the constraints with high probability. The same authors also propose a powerful procedure to include the information provided by the probabilistic constraints into the exploration strategy given by the acquisition function . Although in their framework they consider expected improvement (EI) (Mockus et al., 1978) as the acquisition function, they also state that their formulation is valid for any generic acquisition function . We use herein their formulation to propose a failures-aware acquisition function , which tends to explore in regions where the constraints are satisfied with high probability
The acquisition creftype 20 has two modes of operation: (a) If there exists at least one for which all constraints are satisfied with high probability, we downsize with the probability of constraint satisfaction (i.e., points that are more likely to satisfy the constraints will get a higher weight); (b) if the constraints are violated for all with high probability, whatever point is suggested by will violate the constraint as well. In this case, it is better to exclude and select the point that has the highest probability of satisfying the constraint (Griffiths and Hernández-Lobato, 2017). The latter is merely an exploratory strategy that will seek for some region where the constraint is satisfied. Once such a region is found, the strategy switches to (a).
An important difference of creftype 20 with respect to (Gelbart et al., 2014) is that we assume the true constraint thresholds to be unknown. Instead, we consider the estimated thresholds , available through the GPCR model (see Sec. 3.4). In fact, when modeling with GPCR, the probability is a natural consequence of the model and can be computed following creftype 17.
Fig. 2(d) and 2(c) show the operation mode (a) of the acquisition function in light of the constraint . In both scenarios, the probability of constraint satisfaction , which multiplies the unconstrained acquisition function , are also shown for clarity. In both cases it is clear that downsizing by the probability of constraint satisfaction changes the decision making to areas where it is less likely to violate the constraint.
5.2 Bayesian optimization with unknown binary constraints
Herein, we consider the case in which all the constraints from creftype 18 are binary. As detailed in Sec. 4.2, we propose to absorb them by the GPCR model of . This makes the model push up the probability of the cost at regions predicted to be unstable. Then, those regions will automatically look less appealing to the acquisition function of standard BO algorithms. Therefore, it becomes sufficient to deploy standard BO instead of BOC, although the problem creftype 18 is a constrained optimization problem. This is the main benefit of using the GPCR model in presence of only binary constraints.
Interestingly enough, one could anyways use the BOC strategy creftype 20 in this scenario by considering as the only probabilistic constraint. However, in our experiments such constraint seemed to be redundant, since regions where is small are anyway rarely explored.
5.3 Min-Value Entropy Search with Unknown Level-Set Constraints (mESCO)
Max-Value Entropy Search (Wang and Jegelka, 2017) is a Bayesian optimizer that addresses the problem creftype 2 from an information-theoretic perspective. Whereas (Wang and Jegelka, 2017) consider a maximization problem in their formulation, we characterize it from the point of view of a minimization problem (mES), and extend it to account for unknown level-set constraints (mESCO). The resulting algorithm is called min-Value Entropy Search with Unknown Level-Set Constraints (mESCO). In the following, we first introduce mES and then explain the pertinent modifications that lead to mESCO.
The acquisition function proposed by mES is defined as the gain in mutual information between the next point to query and the minimum
where is the cumulative density function,
is the density function of a normal distribution, and. The distribution is a truncated Gaussian with support . The acquisition function reveals areas where there is significant variance falling below . Acquiring samples is challenging because the distribution is unknown a priori. To this end, (Wang and Jegelka, 2017) propose two different approaches to obtain samples from , which involve numerical approximations that will be briefly explained in the following, while explaining mESCO, for convenience.
Extending mES to account for unknown constraints requires implementing some modifications to the original algorithm, which we detail next.
When addressing the constrained optimization problem creftype 19, mES cannot be used directly because: (i) it gathers samples from the unconstrained minimum , while we are interested in the samples of the constrained minimum , and (ii) using directly samples on creftype 21 makes the acquisition function have extremely large peaks when the sample value is near the values of the observations. For these reasons, we modify the original strategy by applying the following steps:
Since the true constraint thresholds are unknown, we update the current approximates , given by the GPCR model of each constraint , as indicated in Sec. 3.4.2.
In (Wang and Jegelka, 2017), it is proposed to either approximate using a Gumbel distribution, which involves discretizing the input domain , or approximate random function realizations of the GP posterior and minimize them using local optimization. The approximation in the latter proposition comes from applying Bochner’s theorem, which provides a callable object, but as a counterpart, it creates undesired harmonics in the approximated realization that can mislead the search. Additionally, it requires computing the spectral density of the kernel, which might not always be available (see (Hernández-Lobato et al., 2014; Wang and Jegelka, 2017) for details).
While (Wang and Jegelka, 2017) sample from , we need to sample from the distribution over the constrained minimum . For this, we follow a method that does not require numerical approximations. Herein, we describe how to obtain one sample by running local optimization with random restarts on the problem creftype 18 using the approximated thresholds , and virtual evaluations. Every time the local optimizer requests an observation of or at a location , we
sample such observation from the predictive distribution of the corresponding GP, i.e., , which moments are given in creftype 14,
contaminate it with Gaussian noise, , and
include the sampled point in a virtual dataset , which is initialized with the current set of evaluations, i.e., and expanded as virtual evaluations are collected.
This operation is repeated for every new point the local optimizer request in each one of the functions , . Collecting virtual evaluations in this way is equivalent to collecting them from a pre-sampled realization of the GP, with the advantage that it does not need numerical approximations. The caveat of this approach is that the prior kernel matrix from creftype 9 needs to be augmented and inverted every time a new virtual evaluation is added, which has a maximum cost of , where is the maximum number of evaluations we allow per random restart. To alleviate this problem, we expand the inverse of the prior covariance using the Woodbury identity (Rasmussen and Williams, 2006, A.3), which has maximum cost of .
Once the local optimization has finished, we return the constrained optimum as a sample. For each random restart, we reset and repeat this operation until samples are collected.
We modify creftype 21 to include the samples from by replacing with
We drive the acquisition function toward regions that satisfy the constraints with high probability following the strategy creftype 20, where is to be replaced by . As a side effect, the peaks originated by samples being near existing evaluations are effectively downsized by the probability of success, which is low in unsafe areas, where the peaks typically occur. This, in practice they only do not affect the decision-making.
The resulting acquisition function is smooth and can be maximized using analytical gradients:
A pseudocode for mESCO444Our python implementation of both, GPCR and mESCO will be made available upon publication. is presented in Algorithm 1, and additional implementation details are given in Appendix C. An intuitive idea of the decision-making of mESCO is visually illustrated in the one-dimensional synthetic examples (Fig. 2(d) and 2(c)). A more thorough analysis in different benchmarks and dimensionality is provided in Sec. 7.
It is worth remarking that the proposed algorithm is not tied to the use of GPCR. For example, if the constraints are not level-set constraints (i.e., unknown thresholds), but standard constraints (i.e., known thresholds), then all the GPCR models can be replaced by standard GPs, and Algorithm 1 can equally be used by simply skipping the update steps from lines 3 and 4, and using the true constraint thresholds instead.
6 Related work and contributions
There has been recent interest in Bayesian optimization (BO) (Shahriari et al., 2016) for learning robot controllers (von Rohr et al., 2018; Rai et al., 2017; Marco et al., 2016; Calandra et al., 2016; Berkenkamp et al., 2016). While in these works, unstable evaluations have been included in the data set during learning in form of a high penalty, in this paper this issue is fundamentally addressed using GPCR. In the following, we show close connections between the core ideas of this work with other methodologies similar or derivate from BO.
A similar methodology to BO is active learning, where the next controller is selected using an acquisition criterion. For example, in(Schreiter et al., 2015) the optimization problem creftype 18 is approached using safe exploration. More specifically, a negative label is assigned to failure cases, and a positive label is assigned to success cases. A Gaussian process classifier (GPC) is used to learn a probabilistic binary constraint that allows for making probabilistic statements about whether a region is likely to be stable or unstable. While this method appears to be simple, it involves carrying two models: One Gaussian process regression model (GPR) for the objective, and one GPC to model the constraint. The authors show that the input space is safely explored, while the unknown constraint is learned alongside. Observations of such constraint are modeled with a heterogeneous likelihood that perceives (i) continuous valued observations, near the stability boundary, and (ii) discrete labels (unstable/stable), far from it. While this approach shares similarities with the one presented herein, we propose a Bayesian model that explicitly bypasses the need of an extra GPC model by including the discrete labels directly in the GPR, as detailed in in Sec. 2.3. Such model is mathematically different from that of (Schreiter et al., 2015)
and carries less hyperparameters to estimate or choose, while it retains the same flexibility. Finally, the authors(Schreiter et al., 2015) assume that continuous valued observations of are only possible near the transition boundary, while we do not need that assumption.
Another closely related work (Gotovos et al., 2013)
in the context of active learning proposes a method that solves a classification problem for function values to lie above or below a specific threshold. Such problem is posed as a level set estimation problem, where the threshold is either explicitly given by an expert, or implicitly defined within a confidence interval. In the latter case, the authors give convergence guarantees for such interval to decrease below a desired accuracy after a number of evaluations. While our problem formulation shares similarities with this one, there are fundamental differences: (i) Their target is to classify the input space in stable vs. unstable areas, while ours is to find the optimum avoiding unstable regions, revealed during exploration; (ii) in their formulation, the threshold does not impose a critical constraint and thus, observations of the objective can be obtained above it. In our formulation, constraint violation is critical in the sense that no observations of the objective are obtained above the threshold, but just a label indicating that the evaluation was unstable.
In the context of RL for robotics, BO has also been combined with policy search methods to learn feedback policies (Englert and Toussaint, 2018; Drieß et al., 2017). Therein, the unknown constraint is modeled as a GPC, which is then used to compute a safe boundary region in the input domain. They include the information of the boundary directly into the Bayesian optimization criterion, which then trades off exploration, near the current boundary estimate, with exploitation, in a region inside the boundary estimate. While such method is relevant in the context of safe learning, we observe two main aspects: (i) it needs a stable policy as a starting point, and (ii) it expands the safe region around this inital controller, leaving unexplored potential better optima that could exist outside of it. Instead, our method does not assume a single safe region that gets expanded, but can identify multiple safe regions for which the probability of constraint satisfaction is high. Additionally, although they propose safe exploration, their method is not exemt of failures when identifying the safety boundary on a real robot. When failures occur, we see them as a benefit, since they are informative about regions that shall not be visited again. The interesting case in which the optimal solution lies close to the boundary can be found by our method, as well.
Bayesian optimization with unknown constraints (BOC) (Gardner et al., 2014; Snoek, 2013; Gelbart et al., 2014; Gramacy and Lee, 2011; Picheny, 2014; Schonlau et al., 1998; Schreiter et al., 2015; Griffiths and Hernández-Lobato, 2017; Hernández-Lobato et al., 2016) has shown to be effective in well-known optimization benchmarks (Gelbart, 2015; Gelbart et al., 2014)2016). The target of the constrained optimization problem creftype 18 is to optimize an unknown objective while satisfying multiple unknown constraints with high probability. When it comes to determine constraint satisfaction, there are two main differences between the aforementioned BOC techniques and our approach: First, while in BOC it is typically assumed that the constraint threshold is known a priori 555While in the aforementioned papers, the constrained optimization problem is generally formulated as , we can show that our formulation creftype 18 is equivalent by defining . It is important to notice that the constraint , as presented in BOC, implicitly assumes that the threshold (zero) is known, while we do not assume so. , we do not assume so. In our work, we explicitly emphasize a generic unknown threshold , which does not need to be defined a priori, but can be learned from data, either estimated as a model parameter, or by treating it from a Bayesian perspective. This aspect is a core contribution of this work, novel in the context of BOC, and not explored by the aforementioned papers. Second, a prevailing assumption in BOC is that noisy observations of the constraints are always available, even after constraint violation, while in this work, we assume that no continuous-valued observation is obtained upon constraint violation, as stated in Sec. 4. Alternatively, we assign to those measurements a negative label, which is captured by the proposed Bayesian model, and allows to make inference and improve the knowledge about the unknown threshold.
To summarize, in this work we propose a general Bayesian model that can be used to model objective and constraints in BO when unstable evaluations yield no continuous-valued observations. In addition, we extend an existing BO method to account for unknown constraints. In the context of the related work, a detailed list of the contributions is summarized below.
One model for both, classification and regression: The proposed GPCR model bypasses the need of training an additional GP classifier, as done in (Schreiter et al., 2015; Snoek, 2013; Englert and Toussaint, 2018; Drieß et al., 2017), resulting in an equally flexible model, but with a lower number of parameters.
No failure penalties, and no constraint thresholds: The proposed Bayesian model circumvents the need of defining an arbitrary penalty a priori for failures, as commonly done in RL and BO. Instead, whenever the roll-out becomes unstable, the model pushes the predictive probability mass above the unknown stability threshold . In that way, failure points and their vicinity have lower chances to be re-visited again during the optimization search. The unknown stability threshold is a key parameter of the model, and is re-learned from data at each iteration, instead of being heuristically chosen by a designer.
When additional indicators of failure are accessible, they can be included in the problem as additional constraints . Such constraints can then be modeled using the same aforementioned model, for which the constraint threshold is unknown, but can also be learned from data.
To the best of our knowledge, GPCR is the first Bayesian model able to estimate the threshold from a hybrid data set that combines discrete labels and observations of the objective, in the context of GP regression and Bayesian optimization.
Min-Value Entropy Search with unknown level-set constraints (mESCO): Besides the GPCR model, we propose a novel extension of a recent entropy-based Bayesian optimizer: Max-Value Entropy Search (Wang and Jegelka, 2017) to account for unknown constraints: mESCO. Such method is used to validate GPCR, first in several synthetic benchmarks, and second, using a humanoid robot balancing an inverted pole as experimental platform.
7 Experimental results
In this section, we validate GPCR in the context of black-box constrained optimization. We asses the capability of the model to learn the objective and the constraints when their corresponding thresholds are unknown. At the same time, we test the performance of mESCO in finding the constrained global minimum. To this end, we present three different settings. First, in Sec. 7.1, we illustrate the benefits and caveats of GPCR in finding the global minimum in two simulated optimization problems. Second, in Sec. 7.2, we report on the consistency of GPCR and our implementations of mES and mESCO by benchmarking the same simulated scenarios. Last, in Sec. 7.3, we show the benefit of GPCR and mESCO in finding the controller parameters of a real robot balancing an inverted pole.
7.1 Illustrative examples
In this section, GPCR is demonstrated in two different simulated scenarios in 2D. The first simulation illustrates the usability of GPCR when the objective is self-constrained (cf. Sec. 4.1), and the second simulation shows the case of one unconstrained objective and one level-set constraint (cf. Sec. 4.3).
7.1.1 Experiment design choices
For both simulations, the domain of the objective functions and the constraints is scaled to the unit square . We consider an isometric Matérn 3/2 kernel, and all lengthscales and variances fixed a priori. In both cases, we assume a Gaussian hyperprior over the instability threshold in Simulation I and over the constraint threshold in Simulation II. Such thresholds are re-estimated after each iteration using MAP, as explained in Sec. 3.4.2. The user confidence level is fixed with for all the experiments. The initial point
was randomly sampled from a uniform distribution for all cases. For all evaluations, we consider small evaluation noise, i.e.,.
7.1.2 Simulation I
We consider the 2D objective function
proposed in (Gardner et al., 2014) as a benchmark to test constrained Bayesian optimization. While they consider , together with an external constraint , we decide instead to consider that is self-constrained (cf. Sec. 4.1), with instability threshold . This induces a region of unstable controllers , depicted in Fig. 3(a) as a shadowed area. We set a wide hyperprior on the threshold, i.e., , and run mES for 30 iterations. Fig. 3(c) and 3(b) show that mES finds the optimal value after 17 stable and 13 unstable evaluations. The last 12 evaluations are stable and targeted in the area around the global minimum. At the final iteration, the estimated best guess is , , while the true global minimum is , . The instability threshold is estimated to be . While in (Gardner et al., 2014), an algorithm for constrained Bayesian optimization is proposed, it is not possible to fairly compare our framework with theirs because we define differently (i.e., we consider constraint and objective identical). However, we can highlight two clear benefits of using GPCR. First, the GPCR model on the objective absorbs the failures, while in (Gardner et al., 2014), any existing black-box constraint needs to be modeled with an additional standard GP. Furthermore, we can search for the global minimum using an unconstrained Bayesian optimization method, such as mES, without the need of using a constrained Bayesian optimizer (like the one proposed in (Gardner et al., 2014) or mESCO).
(shadowed area), and the true feasible global minimum (blue star). (b) and (c) show the predictive mean and standard deviation ofGPCR, respectively, stable evaluations (red dots) unstable evaluations (orange crosses), and the best guess estimation of the global minimum (magenta triangle). (d) shows the mES acquisition function, and the next suggestion (red dot). Subsequent figures in the paper follow the same legend.
7.1.3 Simulation II
We consider the well-known 2D Branin-Hoo benchmark, constrained to a centered circle
with , which has also been used in (Gelbart et al., 2014). The constraint violation region is depicted as a shadowed area in Fig. 4(e), and the objective is shown in Fig. 4(a). The objective has three minima, marked in Fig. 4(e) and 4(a), but the constraint hides two of them, leaving only one as the solution to the constrained optimization problem. The squared root in the constraint function is deliberately introduced to make it undefined in the outer part of the circle, i.e., if violated, no continuous-valued observation can be obtained, but only a discrete label. This fact characterizes as a level-set constraint (cf. Sec. 4.3). We model using GPCR, while is modeled with a standard GP. We have chosen a wide hyperprior on , with . Fig. 5 shows our proposed constrained Bayesian optimization algorithm, mESCO, after 50 iterations, from which 9 were unstable and 41 were stable. At the final iteration, the estimated best guess is , , while the true constrained minimum is , . The instability threshold is estimated to be . These results are comparably better than (Gelbart et al., 2014), where their estimated global minimum is reported at , with value , and known threshold. Furthermore, the advantage of using GPCR on the constraint as compared to a standard GP classifier (Gelbart et al., 2014) is that GPCR can afford the constraint threshold being unknown a priori, and it can be estimated over iterations.
7.2 Statistical comparison
After having shown the general applicability of the proposed framework, we now report on the consistency of GPCR and our implementations of both, mES and mESCO. To this end, we run the experiments from the previous section multiple times and measure the performance of the global optimization methods using the inference regret at each iteration , defined as (Wang and Jegelka, 2017), where is a function evaluation at the best guess at iteration .
We benchmark the experiments described in Sec. 7.1 over 20 runs for each case and report the averaged in Fig. 5(a) and Fig. 5(b). We can see the results from benchmarking Simulation I in the green curve in Fig. 5(a), and the results from benchmarking Simulation II in the green curve in Fig. 5(b), which correspond to cases 1) and 3) described in Sec. 4.1 and Sec. 4.3, respectively. The remaining cases 2) and 4) are also addressed herein, by reusing the benchmark on Simulation II, but with a few variants, depending on the case.
To address case 2) (Sec. 4.2), we assume that is a binary constraint, and that its binary information is absorbed by GPCR, used to model the objective . To address case 4) (Sec. 4.4), we assume that is self-constrained, with instability threshold , and is a level set constraint. Fig. 5(a) shows the performance of mES in case 2), and Fig. 5(b) shows the performance of mESCO in case 4). To asses the performance of mESCO, we compare against random search using uniform sampling in Fig. 5(b). As shown, mESCO achieves the lowest regret in case 3). We also see that mESCO achieves better minima than the one reported in (Gelbart et al., 2014), on average, i.e., . For all cases, the solution is approximately found with a low error.
In Fig. 5(c), we depict the evolution of and over iterations for case 4). While initially the estimation shows a broad transient, it eventually exhibits a convergent behavior, obtaining on average and at the last iteration. This shows that only a rough approximation to the true threshold is needed for GPCR to work in practice, when using it for BO.
7.3 Robot experiments
The simulation study revealed the possible benefits of GPCR and the effectiveness of mESCO in synthetic scenarios. In this section, we will evaluate its performance on a real robotic platform: The humanoid robot Apollo balancing an inverted pole. This is a much more challenging scenario than the aforementioned simulations because (i) the latent objective and constraint functions are unknown a priori, (ii) collecting evaluations is time expensive, (iii) stopping the experiments due to the controller being unstable or unsafe implies restarting the platform, which is also time consuming, and (iv) the noise is not synthetically generated, but propagated from the sensors to the cost observation we obtain. Such platform has been used as a testbench for controller learning (Marco et al., 2016). The experiments presented in this section build on top of those from (Marco et al., 2016), in which the parameters of a Linear Quadratic Regulator (LQR) controller were automatically tuned in order to improve the balancing performance. Therein, an important caveat of the learning setup was committing to a fixed heuristic cost , prior to starting the learning experiments. Such fixed is a penalty value chosen ad hoc for unstable controllers. As detailed in (Marco et al., 2016), during the learning run, a stable controller (with poor performance) that yielded a higher cost than was found. Because a stable controller should not be penalized more than an unstable one, we stopped the learning run, selected a higher , and restarted the learning from the beginning. Since the choice of the threshold affects the decision-making of the Bayesian optimizer used therein, the data acquired with the older had to be discarded for fairness. The clear practical advantage of GPCR is that it can overcome such situation by adapting to the worst observed stable cost, maintaining the probability mass of unstable areas always above it, and avoiding the need of restarting the learning run and discarding the data.
Additionally, in (Marco et al., 2016) it is reported that there were two main situations that lead to instability, and, in consequence, to premature experiment detention: (i) the endeffector position leaving a safety region and (ii) the endeffector acceleration reaching a maximum value. In the context of this work, those restrictions can be seen as two external safety constraints that determine the shape of .
In this section, we show the advantages of using GPCR to take into account such constraints, and bypass the need of defining a heuristic cost. For this, we show three different tuning scenarios: (i) a 2D tuning problem with two binary constraints, (ii) idem, with one binary constraint, and one level-set constraint, and (iii) a 5D tuning problem with two binary constraints.
7.3.1 Controller tuning with unknown constraints
In (Marco et al., 2016), the design parameters of the well known Linear Quadratic Regulator (LQR) (Anderson and Moore, 1990, Sec. 2.4) are learned from data. Therein, the performance of a balancing experiment over a finite time horizon is quantified with a quadratic cost , where the state and control input trajectories are collected during an balancing experiment with control parameters . Such parameters enter the diagonal of the control design weights and , used to compute the feedback gain matrix that characterizes the control strategy