1 Introduction
Reinforcement learning (RL) has seen successes in domains such as video and board games mnih2013playing; silver2016mastering; silver2017mastering, and control of simulated robots ammar2014online; schulman2015trust; schulman2017proximal. Though successful, these applications assume idealised simulators and require tens of millions of agentenvironment interactions typically performed by randomly exploring policies. In realworld safetycritical applications, however, such an idealised framework of random exploration with the ability to gather samples at ease falls short, partly due to the catastrophic costs of failure and the high operating costs. Hence, if RL algorithms are to be applied in the real world, safe agents that are sampleefficient and capable of mitigating risk need to be developed. To this end, different works adopt varying safety definitions, where some are interested in safe learning, i.e., safety during the learning process, while others focus on acquiring safe policies eventually. Safe learning generally requires some form of safe initial policy as well as regularity assumptions on dynamics, all of which depend on which notion of safety is considered. For instance, safety defined by constraining trajectories to safe regions of the stateaction space is studied in, e.g., akametalu2014reachability, which assumes partiallyknown controlaffine dynamics with Lipschitz regularity conditions, as well as in koller2018learning; aswani2013provably, both of which require strong assumptions on dynamics and initial control policies to give theoretical guarantees of safety. Safety in terms of Lyapunov stability khalil2002nonlinear is studied in, e.g., chow2018lyapunov; chow2019lyapunov (modelfree), which require a safe initial policy, and berkenkamp2017safe (modelbased), which requires Lipschitz dynamics and a safe initial policy. The work in wachi2018safe (which builds on turchetta2016safe) considers deterministic dynamics and attempts to optimise expected return while not violating a prespecified safety threshold. Several papers attempt to keep expectation constraints satisfied during learning, e.g., achiam2017constrained extends ideas of kakade2002approximately, dalal2018safe adds a safety layer which makes action corrections to maintain safety. When it comes to safe final policies (defined in terms of corresponding constraints), on the other hand, some works chow2014algorithms; prashanth2014policy considered riskbased constraints and developed modelfree solvers.
Unfortunately, most of these methods are sampleinefficient and make a large number of visits to unsafe regions. Given our interest in algorithms achieving safe final policies while reducing the number of visits to unsafe regions, we pursue a safe modelbased framework that assumes no safe initial policies nor a priori knowledge of the transition model. As we believe that sample efficiency is key for effective safe solvers, we choose Gaussian processes GPbook for our model. Of course, any such framework is hampered by the quality and assumptions of the model’s hypothesis space. Aiming at alleviating these restrictions, we go further and integrate active learning (discussed in Section 2.2), wherein the agent influences where to query/sample from in order to generate new data so as to reduce model uncertainty, and so ultimately to learn more efficiently. Successful application of this approach is very much predicated upon the chosen method of quantifying potential uncertainty reduction, i.e., what we refer to as the (semi)active metric
. Common (semi)active metrics are those which identify points in the model with large entropy or large variance, or where those quantities would be most reduced in a posterior model if the given point were added to the data set
krause2007nonmyopic; krause2008near; fedorov2013theory; settles2009active. However, our desire for safety adds a complication, since a safe learning algorithm will likely have greater uncertainty in regions where it is unsafe by virtue of not exploring those regions deisenroth2011pilco; kamthe2017data. Indeed our experiments (Figure 1) support this claim.Attacking the above challenges, we propose two novel outofsample (semi)metrics for Gaussian processes that allow for exploration in novel areas while remaining close to training data, thus avoiding unsafe regions. To enable effective and grounded introduction of active exploration and safety constraints, we define a novel constrained biobjective formulation of RL and provide a policy multigradient solver that is proven effective on a variety of safety benchmarks. In short, our contributions can be stated as follows: 1) novel constrained biobjective formulation enabling exploration and safety consideration, 2) safetyaware active (semi)metrics for exploration, and 3) policy multigradient solver trading off cost minimisation, exploration maximisation, and constraint feasibility. We test our algorithm on three stochastic dynamical systems after augmenting these with safety regions and demonstrate a significant reduction in sample and cost complexities compared to the stateoftheart.
2 Background and notation
2.1 Reinforcement learning
We consider Markov decision processes (MDPs) with continuous states and action spaces;
, where denotes the state space, the action space, is a transition density function, is the cost function and is a discount factor. At each time step , the agent is in state and chooses an action transitioning it to a successor state , and yielding a cost . Given a state , an action is sampled from a policy , where we write to represent the conditional density of an action. Upon subsequent interactions, the agent collects a trajectory , and aims to determine an optimal policy by minimising total expected cost: , where denotes the trajectory density defined as: , with being an initial state distribution.Constrained MDPs: The above can be generalised to include various forms of constraints, often motivated by the desire to impose some form of safety measures. Examples are expectation constraints achiam2017constrained; altman1999constrained (which have the same form as the objective, i.e., expected discounted sum of costs), constraints on the variance of the return prashanth2013actor, chance constraints (a.k.a. ValueatRisk (VaR)) chow2017risk, and Conditional ValueatRisk (CVaR) chow2014algorithms; chow2017risk; prashanth2014policy. The latter is the constraint we adopt in this paper for reasons that will be elucidated upon below. Adding constraints means we can’t directly apply standard algorithms like policy gradient sutton2018reinforcement, and different techniques are required, e.g., via Lagrange multipliers bertsekas1997nonlinear, as was done in chow2014algorithms; chow2017risk; prashanth2014policy besides many others. Further, current methods only consider cost minimisation with no regard to exploration as we do in this paper.
ModelBased Reinforcement Learning: Current solutions to the problem described above (constrained or unconstrained) can be split into modelfree and modelbased methods. Though effective, modelfree algorithms are highly sample inefficient hessel2018rainbow. For sampleefficient solvers, we follow modelbased strategies that we now detail. To reduce the number of interactions with the real environments, modelbased solvers build surrogate models, , to determine optimal policies. These methods, typically, run two main loops. The first gathers traces from the real environment to update , while the second improves the policy using deisenroth2011pilco; hafner2019dream. Among various candidate models, e.g., world models ha2018world, in this paper, we follow PILCO deisenroth2011pilco and adopt Gaussian processes (GPs) as we believe that uncertainty quantification and sample efficiency are key for realworld considerations of safety. In this construction, one places a Gaussian process prior on a latent function to map between inputoutput pairs. Such a prior is fully specified by a mean, , and a covariance function GPbook. We write to emphasize that is sampled from a GP GPbook. Given a dataset of inputoutput pairs , corresponding, respectively, to stateaction and successor state tuples, one can perform predictions on a query set of test data points
. Such a distribution is Gaussian with predictive meanvectors and covariance matrices given by:
and , where with being the noise covariance that is assumed to be Gaussian. In the above, we also defined as a vector concatenating all training labels, , , and , where and are feature matrices with and sizes respectively. We executed training in GPyTorch gardner2018gpytorch, and used multioutputGPs as defined in wolff2020mogptk.2.2 Active learning in dynamical systems
In active learning fedorov2013theory; settles2009active, an agent chooses points to sample/query that best improve learning or model updates. This is often performed by optimising an acquisition function
, which gives some quantification of how much a model would improve if a given data point were queried, e.g., points where the model has high entropy or where variance can be most reduced. Active learning with GPs has been studied in the static case, where points can be selected at will (see, e.g.,
krause2007nonmyopic; krause2008near). In the context of dynamical systems, however, added complications arise as one is not always able to directly drive the system into a desired state. Recent work has attempted to resolve this problem, e.g., in buisson2019actively and schultheis2019receding, receding horizon optimisation is used to iteratively update a model, and in buisson2019actively, actions are favoured that maximise the sum of differential entropy terms at each point in the mean trajectory predicted to occur by those actions. Moreover, in schultheis2019receding, a sum of variance terms is optimised to improve Bayesian linear regression. Again, for computational tractability, the predicted mean of states is used as propagating state distributions in the model is difficult. Different to our paper, neither of these works deal with safety, nor do they have additional objectives to maximise/minimise avoiding a biobjective formulation. In
jain2018learning a GP model that is used for MPC is updated by greedily selecting points which maximise information gain, i.e., reduction in entropy, as is done in krause2008near. Only very recently, the authors in ball2020ready proposed an active learning approach coupled with MBRL. Similar to SAMBA, they use an adaptive convex combination of objectives, however their exploration metric is based on reward variance computed from a (finite) collection of models increasing the burden on practitioners who now need to predefine the collection of dynamics. They do not use GPs as we do, and do not consider safety. Compared to ball2020ready, we believe SAMBA is more flexible supporting modellearning from scratch and enabling principled exploration coupled with safety consideration. Further afield from our work, active learning has been recently studied in the context of GP timeseries in zimmer2018safe, and for pure exploration in shyam2018model, which uses a finite collection of models. Our (semi)metrics generalise the above to consider saferegions and future information tradeoff as we detail in Section 3.2.3 SAMBA: Framework & solution
In designing SAMBA, we take PILCO deisenroth2011pilco as a template and introduce two novel ingredients allowing for active exploration and safety. Following PILCO, SAMBA runs a main loop that gathers traces from the real environment and updates a surrogate model, , encoded by a Gaussian process. Given , PILCO and other modelbased methods srinivas2020curl attempt to obtain a policy that minimises totalexpect cost with respect to traces, , sampled from the learnt model by solving with . The updated policy is then used to sample new traces from the real system where the above process repeats. During this sampling process, modelbased algorithms consider various metrics in acquiring transitions that reveal novel information, which can be used to improve the surrogate model’s performance. PILCO, for instance, makes use of the GP uncertainty, while ensemble models saphal2020seerl; van2020simple explore by their aggregated uncertainties. With sufficient exploration, this allows policies obtained from surrogatemodels to control realsystems. Our safety considerations mean we would prefer agents that learn wellbehaving policies with minimal sampling from unsafe regions of stateaction spaces; a property we achieve later by incorporating CVaR constraints as we detail in Section 3.1. Requiring a reduced number of visits to unsafe regions, hence, lessens the amount of “unsafe” data gathered in such areas by definition. Therefore, model entropy is naturally increased in these territories and algorithms following such exploration strategies are, as a result, encouraged to sample from hazardous states. As such, a naive adaptation of entropybased exploration can quickly become problematic by contradicting safety requirements. To circumvent these problems, we introduce two new active exploration (semi)metrics in Section 3.2
, that assess information beyond trainingdata availability and consider inputoutput data distributions. Our (semi)metrics operate under the assumption that during any model update step, “safe” transition data (i.e., a set of stateactionsuccessor states sampled from safe regions) is more abundant in number when compared to “unsafe” triplets. Considering such a skew between distributions, our (semi)metrics yield increased values on test queries close to highdensity training data. Given such (semi)metrics, we enable novel modelbased algorithms that solve a biobjective optimisation problem that attempts to minimise cost, while maximising active values. In other words, during this step, we not only update policies to be wellbehaving in terms of the total cost but also to actively explore safe transitions that allow for improved models in successor iterations.
Of course, the assumption of having skew towards safe regions in training data distribution is generally not true since solving the above only ensures good expected returns. To frequently sample safe regions, we augment cost minimisation with a safety constraint that is encoded through the CVaR of a userdefined safety cost function with respect to model traces. Hence, SAMBA solves a biobjective constrained optimisation problem (§3.1) aimed at minimising cost, maximising active exploration, and meeting safety constraints.
3.1 Biobjective constrained MDPs
Given a (GP) model of an environment, we formalise our problem as a generalisation of constrained MDPs to support biobjective losses. We define a biobjective MDP by a tuple consisting of state space , action space , Gaussian process transition model , cost function , constraint cost function (used to encode safety) , additional objective function , and discount factor .^{1}^{1}1It is worth noting that the (semi)metrics we develop in Section 3.2 map to instead of . In our setting, for instance, encodes the stateaction’s risk by measuring the distance to an unsafe region, while denotes an active (semi)metric from these described in Section 3.2. To finalise the problem definition of biobjective MDPs, we need to consider an approachable constraint to describe safety considerations. In incorporating such constraints, we are chiefly interested in those that are flexible (i.e., can support different userdesigned safety criteria) and allow us to quantify events occurring in tails of cost distributions. When surveying singleobjective constrained MDPs, we realise that the literature predominately focuses on expectationtype constraints achiam2017constrained; raybenchmarking – a not so flexible approach restricted to being safe on average. Others, however, make use of conditionalvalueatrisk (CVaR); a coherent risk measure chow2017risk that provides powerful and flexible notions of safety (i.e., can support expectation, taildistribution, or hard – unsafe state visitation – constraints) and quantifies tail risk in the worst
quantile. Formally, given a random variable
, is defined as: , where . With such a constraint, we can write the optimisation problem of our biobjective MDP as:(1) 
with being total accumulated safety cost along , and a safety threshold. Of course, Equation 1 describes a problem not standard to reinforcement learning. In Section 3.3, we devise policy multigradient updates to determine .
3.2 functions for safe active exploration
In general, can be any bounded objective that needs to be maximised/minimised in addition to standard cost. Here, we choose one that enables active exploration in safe stateaction regions. To construct , we note that a feasible policy – i.e., one abiding by CVaR constraints – of the problem in Equation 1 samples tuples that mostly reside in safe regions. As such, the training data distribution is skewed in the sense that safe stateaction pairs are more abundant than unsafe ones. Exploiting such skewness, we can indirectly encourage agents to sample safe transitions by maximising information gain (semi)metrics that only grow in areas closeenough to training data.
: LeaveOneOut semimetric
Consider a GP dynamics model, , that is trained on a stateactionsuccessorstate data set with .^{2}^{2}2As changes at every outer iteration, we simply concatenate all data in one larger data set; see Algorithm 1. Such a GP induces a posterior allowing us to query predictions on test points . As noted in Section 2.1, the posterior is also Gaussian with the following mean vector and covariance matrix:^{3}^{3}3For clarity, we describe a onedimensional scenario. We extend to multioutput GPs in our experiments. . Our goal is to design a measure that increases in regions with dense trainingdata (due to the usage of CVaR constraint) to aid agents in exploring novel yet safe tuples. To that end, we propose using an expected leaveoneout semimetric between two Gaussian processes defined, for a one query data point , as: with being with point leftout. Importantly, such a measure will only grow in regions which are closeenough to sampled training data, as posterior mean and covariance of shift by a factor that scales linearly and quadratically, respectively, with the total covariance between and where denotes a feature matrix with the row removed.^{4}^{4}4Though intuitive, we provide a formal treatment of the reason behind such growth properties in the appendix. In other words, such a semimetric fulfils our requirement in the sense that if a test query is distant (in distribution) from all training input data, it will achieve low score. Though appealing, computing a fullset of can be highly computationally expensive, of the order of – computing requires and this has to be repeated times. A major source contributing to this expense, wellknown in GP literature, is related to the need to invert covariance matrices. Rather than following variational approximations (which constitute an interesting direction), we prioritise sampleefficiency and focus on exact GPs. To this end, we exploit the already computed during the modellearning step and makeuse of the matrix inversion lemma petersen2008matrix to recursively update the mean and covariances of for all (see appendix): , with being the row of . Hence, updating the inverse covariance matrix only requires computing and adding the outer product of the row , divided by the diagonal element. This, in turn, reduces complexity from to .
: Bootstrapped symmetric metric
We also experimented with another metric that quantifies posterior sensitivity to bipartitions of the data as measured by symmetric KLdivergence,^{5}^{5}5A symmetric KLdivergence between two distributions p, and q is defined as: . averaged over possible bipartitions: , where is a random bipartition of the data . In practice, we randomly split the data in half, and do this times (where is a tuneable hyperparameter) to get a collection of bipartitions. We then average over that collection. Similar to , also assigns low importance to query points far from the training inputs, and hence, can be useful for safedecision making. In our experiments, provided betterbehaving exploration strategy, see Section 4.
Transforming to
Both introduced functions are defined in terms of query test points . To incorporate in Equation 1, we define trajectorybased expected total information gain as . Interestingly, this characterisation trades off longterm versus shortterm information gain similar to how cost tradesoff optimal greedy actions versus longterm decisions. In other words, it is not necessarily optimal to seek an action that maximises immediate information gain since such a transition can ultimately drive the agent to unsafe states (i.e., ones that exhibit low values). In fact, such horizonbased definitions have also recently been shown to improve modelling of dynamical systems buisson2019actively; shyam2018model. Of course, our problem is different in the sense that we seek safe policies in a safe decisionmaking framework, and thus require safely exploring (semi)metrics.
3.3 Solution method
We now provide a solver to the problem in Equation 1. We operate using and note that our derivations can exactly be repeated for . Since we maximise exploration, we use in the minimisation problem in Equation 1. Effectively, we need to overcome two hurdles for an implementable algorithm. The first, relates to the biobjective nature of our problem, while the second is concerned with the CVaR constraint that requires a Lagrangiantype solution.
From Bi to Single objectives: We transform the biobjective problem into a single objective one through a linear combination of and . This relaxed yet constrained version is given by ,^{6}^{6}6Please note that the negative sign in the linear combination is due to the fact that we used . where is a policy dependent weighting. The choice of , however, can be difficult as not any arbitrary combination is acceptable as it has to ultimately yield a Paretoefficient solution.
Fortunately, the authors in sener2018multi have demonstrated that a theoreticallygrounded choice for such a weighting in a stochastic multiobjective problem is one that produces a common descent direction that points opposite to the minimumnorm vector in the convex hull of the gradients, i.e., one solving: . Luckily, solving for is straightforward and can be encoded using a rulebased strategy that compares gradient norms, see appendix for further details.
From Constrained to Unconstrained Objectives: We write an unconstrained problem using a Lagrange multiplier :
. Due to nonconvexity of the problem, we cannot assume strong duality holds, so in our experiments, we schedule proportional to gradients using a technique similar to that in schulman2017proximal that has proven effective.^{7}^{7}7Note that a primal dualmethod as in chow2015risk is not applicable due to nonconvexity. In the future, we plan to study approaches from goh2001nonlinear to ease determining . To solve the above optimisation problem, we first fix and perform a policy gradient step in .^{8}^{8}8We resorted to policy gradients for two reason: 1) cost functions are not necessarily differentiable, and 2) better experimental behaviour when compared to model backprop especially on OpenAI’s safety gym tasks.
To minimise the variance in the gradient estimator of
and, we build two neural network critics that we use as baselines. The first attempts to model the value of the standard cost, while the second learns information gain values. For the CVaR’s gradient, we simply apply policy gradients. As CVaR is nonMarkovian, it is difficult to estimate its separate critic. In our experiments, a heuristic where discounted safety losses as unbiased baselines was used and proved effective. In short, our main update equations when using a policy parameterised by a neural network with parameters
can be written as:(2)  
where is a learning rate, and , are neural network critics with parameters and . We present the main steps in Algorithm 1 and more details in the appendix.
4 Experiments
We assess SAMBA in terms of both safe learning (train) and safe final policies (test) on three dynamical systems, two of which are adaptations of standard dynamical systems for MBRL (Safe Pendulum and Safe CartPole Double Pendulum), while the third (Fetch Robot – optimally control endeffector to reach a 3D goal) we adapt from OpenAI’s robotics environments brockman2016openai. In each of these tasks, we define unsafe regions as areas in state spaces and design the safety loss (i.e., ) to correspond to the (linearly proportional) distance between the endeffector’s position (when in the hazard region) to the centre of the unsafe region. SAMBA implemented a more stable proximal update of Equation 2 following a similar method to schulman2017proximal. We compare against algorithms from both modelfree and modelbased literature. Modelfree comparisons against TRPO schulman2015trust, PPO schulman2017proximal, CPO achiam2017constrained, STRPO (safetyconstrained TRPO) raybenchmarking and SPPO raybenchmarking (safetyconstrained PPO) enable us to determine if SAMBA improves upon the following: sample complexities during training; total violations (TV), that is, the total number of timesteps spent inside the unsafe region; total accumulated safety cost (TC).^{9}^{9}9Note, we report safe learning process TC, which is the total incurred safety cost throughout all training environment interactions, and safe evaluation TC and TV, which similarly is the total incurred safety cost during evaluation, and the total violations from the timesteps spent inside the unsafe region during evaluation. Comparison with modelbased solvers (e.g., PlaNet hafner2018learning, (P)PILCO deisenroth2011pilco
) sheds light on the importance of our active exploration metrics. It is important to note that when implementing PILCO, we preferred a flexible solution that does not assume momentmatching and specific radialbasis function controllers. Hence, we adapted PILCO to support proximal policy updates, referred to as PPILCO in our experiments, and similarly, SPPILCO (safetyconstrained PPILCO), which also proved effective. As SAMBA introduces exploration components to standard modelbased learners, we analysed these independently before combining them and reporting TV and TC
9 (see Table 1). All policies are represented by twohiddenlayer (32 units each) neural networks with nonlinearities. Each requires under 12 hours of training on an NVIDIA GeForce RTX 2080 Ti which has a power consumption of 225W, yielding an estimated training cost of £ per model, per seed. Due to space constraints, all hyperparameters to reproduce our results can be found in the appendix.(Semi)metrics Component: Evaluating our (semimetrics), we conducted an analysis that reports a 2D projection view of the state space at various intervals in the datacollection process. We compare and against an entropybased exploration metric and report the results on two systems in Figure 1. It is clear that both our (semi)metrics encourage agents to explore safe regions in the state space as opposed to entropy that mostly grows in unsafe regions. Similar results are demonstrated with the Fetch Reach robot task (in the appendix). It is also worth noting that due to the highdimensional nature of the tasks, visual analysis can only give indications. Still, empirical analysis supports our claim and performance improvements are clear; see Table 1.
Learning and Evaluation: Having concluded that our (semi)metrics indeed provide informative signals to the agent for safe exploration, we then conducted learning and evaluation experiments comparing SAMBA against stateoftheart methods. Results reported in Table 1 demonstrate that SAMBA reduces the amount of training TC, 9 and samples by orders of magnitude compared to others. Interestingly, during safe evaluation (deploying learnt policy and evaluating performance), we see SAMBA’s safety performance competitive with (if not significantly better than) policies trained for safety in terms of TC and TV. 9 Such results are interesting as SAMBA was never explicitly designed to specifically minimise test TV, 9 but it was still able to acquire significant reductions. Of course, one might argue that these results do not convey a safe final policy, as violations are still nonzero. We remind, however, that we define safety in terms of CVaR constraints, which do not totally prohibit violations, rather, limit the average cost of excess violations beyond a (userdefined) risk level . Indeed, as mentioned above, it is not possible to guarantee total safety without strong assumptions on dynamics and/or an initial safe policy (and of course, none of the algorithms in Table 1 have zero violations).
Learner  Safe Pendulum  Safe Cart Pole Double Pendulum  Safe Fetch Reach  

Learning  Evaluation  Learning  Evaluation  Learning  Evaluation  
Samples  TC  TV  TC  Samples  TC  TV  TC  Samples  TC  TV  TC  
TRPO  1000  110  4.3  4.9  1000  310  10  23  1000  790  1.8  2.7 
PPO  1000  81  3.6  5  1000  140  5.5  35  1000  670  1.6  3.1 
PlanNet  100  56  4.5  5.2  40  2.6  7.7  29  100  110  2.3  2.9 
PPILCO  2  0.04  2.8  4.5  6  0.097  7.1  33  5  0.49  1.9  3 
PlaNet w RS  100  51  3.5  4.2  40  1.3  1.6  2.5         
CPO  1000  58  4.7  4.4  1000  95  1.1  5  1000  160  1.9  2.7 
STRPO  1000  38  2.0  2.1  1000  83  2.1  2.7  1000  170  1.7  2.1 
SPPO  1000  65  1.7  2.4  1000  68  1  1.8  1000  290  1.5  2 
SPPILCO  1.8  0.02  1.7  2.1  6  0.062  1.2  1.6  5  0.37  1.4  2.2 
SAMBA  1.6  0.01  1.5  2.0  5  0.054  0.85  1.4  5  0.23  1.2  1.9 

The data is scaled by . Note: PlaNet w RS on Safe Fetch Reach diverged during training.
To evaluate risk performance, we conducted an indepth study evaluating all methods on the two most challenging environments, Safe Cart Pole Double Pendulum and Safe Fetch Reach, using cost limits and . Table 2
demonstrates that SAMBA achieves lower safety cost quartiles and lower expected safety cost. Therefore, we conclude that, indeed, SAMBA produces safefinal policies in terms of its objective in Equation
1.Learner  Safe Cart Pole Double Pendulum  Safe Fetch Reach  
Quartile  Constraint  Quartile  Constraint  
Exp.  CVaR  Exp.  CVaR  
PPO  5.0  5.7  9.4  19  0.95  1.25  1.88  0.26  
PPILCO  5.0  5.8  8.6  21  0.91  1.22  1.45  0.36  
PlaNet  5.2  6.0  8.8  21  0.64  1.09  1.88  0.39  
TRPO  5.1  5.9  7.0  22  0.69  0.79  0.92  0.58  
CPO  0.41  0.47  0.96  23  0.81  0.92  1.02  0.43  
PlaNet w RS  0.83  0.94  1.11  24              
SPPILCO  0.21  0.28  0.33  25  0.45  0.77  1.13  0.57  
STRPO  0.22  0.27  0.32  28  0.14  0.46  0.86  0.98  
SPPO  0.19  0.24  0.29  27  0.27  0.44  0.71  1.51  
SAMBA  0.15  0.21  0.24  27  0.00  0.05  0.19  2.27 

Note: PlaNet w RS on Safe Fetch Reach diverged during training.
5 Conclusion and future work
We proposed SAMBA, a safe and active modelbased learner that makes use of GP models and solves a biobjective constraint problem to tradeoff cost minimisation, exploration, and safety. We evaluated our method on three benchmarks, including ones from Open AI’s safety gym and demonstrated significant reduction in training cost and sample complexity, as well as safefinal policies in terms of CVaR constraints compared to the stateoftheart.
In future, we plan to generalise to variational GPs damianou2011variational, and to apply our method in realworld robotics. Additionally, we want to study the theoretical guarantees of SAMBA to demonstrate convergence to wellbehaving, exploratory, and safe policies.
Broader Impact
Not applicable.
References
Appendix A Algorithm
In this section we provide a more detailed description of Algorithm 1. We shall use the following shorthand for the advantage function:
Comments
There are no comments yet.