Reinforcement learning (RL) has seen successes in domains such as video and board games mnih2013playing; silver2016mastering; silver2017mastering, and control of simulated robots ammar2014online; schulman2015trust; schulman2017proximal. Though successful, these applications assume idealised simulators and require tens of millions of agent-environment interactions typically performed by randomly exploring policies. In real-world safety-critical applications, however, such an idealised framework of random exploration with the ability to gather samples at ease falls short, partly due to the catastrophic costs of failure and the high operating costs. Hence, if RL algorithms are to be applied in the real world, safe agents that are sample-efficient and capable of mitigating risk need to be developed. To this end, different works adopt varying safety definitions, where some are interested in safe learning, i.e., safety during the learning process, while others focus on acquiring safe policies eventually. Safe learning generally requires some form of safe initial policy as well as regularity assumptions on dynamics, all of which depend on which notion of safety is considered. For instance, safety defined by constraining trajectories to safe regions of the state-action space is studied in, e.g., akametalu2014reachability, which assumes partially-known control-affine dynamics with Lipschitz regularity conditions, as well as in koller2018learning; aswani2013provably, both of which require strong assumptions on dynamics and initial control policies to give theoretical guarantees of safety. Safety in terms of Lyapunov stability khalil2002nonlinear is studied in, e.g., chow2018lyapunov; chow2019lyapunov (model-free), which require a safe initial policy, and berkenkamp2017safe (model-based), which requires Lipschitz dynamics and a safe initial policy. The work in wachi2018safe (which builds on turchetta2016safe) considers deterministic dynamics and attempts to optimise expected return while not violating a pre-specified safety threshold. Several papers attempt to keep expectation constraints satisfied during learning, e.g., achiam2017constrained extends ideas of kakade2002approximately, dalal2018safe adds a safety layer which makes action corrections to maintain safety. When it comes to safe final policies (defined in terms of corresponding constraints), on the other hand, some works chow2014algorithms; prashanth2014policy considered risk-based constraints and developed model-free solvers.
Unfortunately, most of these methods are sample-inefficient and make a large number of visits to unsafe regions. Given our interest in algorithms achieving safe final policies while reducing the number of visits to unsafe regions, we pursue a safe model-based framework that assumes no safe initial policies nor a priori knowledge of the transition model. As we believe that sample efficiency is key for effective safe solvers, we choose Gaussian processes GPbook for our model. Of course, any such framework is hampered by the quality and assumptions of the model’s hypothesis space. Aiming at alleviating these restrictions, we go further and integrate active learning (discussed in Section 2.2), wherein the agent influences where to query/sample from in order to generate new data so as to reduce model uncertainty, and so ultimately to learn more efficiently. Successful application of this approach is very much predicated upon the chosen method of quantifying potential uncertainty reduction, i.e., what we refer to as the (semi-)active metric
. Common (semi-)active metrics are those which identify points in the model with large entropy or large variance, or where those quantities would be most reduced in a posterior model if the given point were added to the data setkrause2007nonmyopic; krause2008near; fedorov2013theory; settles2009active. However, our desire for safety adds a complication, since a safe learning algorithm will likely have greater uncertainty in regions where it is unsafe by virtue of not exploring those regions deisenroth2011pilco; kamthe2017data. Indeed our experiments (Figure 1) support this claim.
Attacking the above challenges, we propose two novel out-of-sample (semi-)metrics for Gaussian processes that allow for exploration in novel areas while remaining close to training data, thus avoiding unsafe regions. To enable effective and grounded introduction of active exploration and safety constraints, we define a novel constrained bi-objective formulation of RL and provide a policy multi-gradient solver that is proven effective on a variety of safety benchmarks. In short, our contributions can be stated as follows: 1) novel constrained bi-objective formulation enabling exploration and safety consideration, 2) safety-aware active (semi-)metrics for exploration, and 3) policy multi-gradient solver trading off cost minimisation, exploration maximisation, and constraint feasibility. We test our algorithm on three stochastic dynamical systems after augmenting these with safety regions and demonstrate a significant reduction in sample and cost complexities compared to the state-of-the-art.
2 Background and notation
2.1 Reinforcement learning
We consider Markov decision processes (MDPs) with continuous states and action spaces;, where denotes the state space, the action space, is a transition density function, is the cost function and is a discount factor. At each time step , the agent is in state and chooses an action transitioning it to a successor state , and yielding a cost . Given a state , an action is sampled from a policy , where we write to represent the conditional density of an action. Upon subsequent interactions, the agent collects a trajectory , and aims to determine an optimal policy by minimising total expected cost: , where denotes the trajectory density defined as: , with being an initial state distribution.
Constrained MDPs: The above can be generalised to include various forms of constraints, often motivated by the desire to impose some form of safety measures. Examples are expectation constraints achiam2017constrained; altman1999constrained (which have the same form as the objective, i.e., expected discounted sum of costs), constraints on the variance of the return prashanth2013actor, chance constraints (a.k.a. Value-at-Risk (VaR)) chow2017risk, and Conditional Value-at-Risk (CVaR) chow2014algorithms; chow2017risk; prashanth2014policy. The latter is the constraint we adopt in this paper for reasons that will be elucidated upon below. Adding constraints means we can’t directly apply standard algorithms like policy gradient sutton2018reinforcement, and different techniques are required, e.g., via Lagrange multipliers bertsekas1997nonlinear, as was done in chow2014algorithms; chow2017risk; prashanth2014policy besides many others. Further, current methods only consider cost minimisation with no regard to exploration as we do in this paper.
Model-Based Reinforcement Learning: Current solutions to the problem described above (constrained or unconstrained) can be split into model-free and model-based methods. Though effective, model-free algorithms are highly sample inefficient hessel2018rainbow. For sample-efficient solvers, we follow model-based strategies that we now detail. To reduce the number of interactions with the real environments, model-based solvers build surrogate models, , to determine optimal policies. These methods, typically, run two main loops. The first gathers traces from the real environment to update , while the second improves the policy using deisenroth2011pilco; hafner2019dream. Among various candidate models, e.g., world models ha2018world, in this paper, we follow PILCO deisenroth2011pilco and adopt Gaussian processes (GPs) as we believe that uncertainty quantification and sample efficiency are key for real-world considerations of safety. In this construction, one places a Gaussian process prior on a latent function to map between input-output pairs. Such a prior is fully specified by a mean, , and a covariance function GPbook. We write to emphasize that is sampled from a GP GPbook. Given a data-set of input-output pairs , corresponding, respectively, to state-action and successor state tuples, one can perform predictions on a query set of test data points
. Such a distribution is Gaussian with predictive mean-vectors and covariance matrices given by:and , where with being the noise covariance that is assumed to be Gaussian. In the above, we also defined as a vector concatenating all training labels, , , and , where and are feature matrices with and sizes respectively. We executed training in GPyTorch gardner2018gpytorch, and used multi-output-GPs as defined in wolff2020mogptk.
2.2 Active learning in dynamical systems
In active learning fedorov2013theory; settles2009active, an agent chooses points to sample/query that best improve learning or model updates. This is often performed by optimising an acquisition function
, which gives some quantification of how much a model would improve if a given data point were queried, e.g., points where the model has high entropy or where variance can be most reduced. Active learning with GPs has been studied in the static case, where points can be selected at will (see, e.g.,krause2007nonmyopic; krause2008near). In the context of dynamical systems, however, added complications arise as one is not always able to directly drive the system into a desired state. Recent work has attempted to resolve this problem, e.g., in buisson2019actively and schultheis2019receding, receding horizon optimisation is used to iteratively update a model, and in buisson2019actively, actions are favoured that maximise the sum of differential entropy terms at each point in the mean trajectory predicted to occur by those actions. Moreover, in schultheis2019receding
, a sum of variance terms is optimised to improve Bayesian linear regression. Again, for computational tractability, the predicted mean of states is used as propagating state distributions in the model is difficult. Different to our paper, neither of these works deal with safety, nor do they have additional objectives to maximise/minimise avoiding a bi-objective formulation. Injain2018learning a GP model that is used for MPC is updated by greedily selecting points which maximise information gain, i.e., reduction in entropy, as is done in krause2008near. Only very recently, the authors in ball2020ready proposed an active learning approach coupled with MBRL. Similar to SAMBA, they use an adaptive convex combination of objectives, however their exploration metric is based on reward variance computed from a (finite) collection of models increasing the burden on practitioners who now need to predefine the collection of dynamics. They do not use GPs as we do, and do not consider safety. Compared to ball2020ready, we believe SAMBA is more flexible supporting model-learning from scratch and enabling principled exploration coupled with safety consideration. Further afield from our work, active learning has been recently studied in the context of GP time-series in zimmer2018safe, and for pure exploration in shyam2018model, which uses a finite collection of models. Our (semi-)metrics generalise the above to consider safe-regions and future information trade-off as we detail in Section 3.2.
3 SAMBA: Framework & solution
In designing SAMBA, we take PILCO deisenroth2011pilco as a template and introduce two novel ingredients allowing for active exploration and safety. Following PILCO, SAMBA runs a main loop that gathers traces from the real environment and updates a surrogate model, , encoded by a Gaussian process. Given , PILCO and other model-based methods srinivas2020curl attempt to obtain a policy that minimises total-expect cost with respect to traces, , sampled from the learnt model by solving with . The updated policy is then used to sample new traces from the real system where the above process repeats. During this sampling process, model-based algorithms consider various metrics in acquiring transitions that reveal novel information, which can be used to improve the surrogate model’s performance. PILCO, for instance, makes use of the GP uncertainty, while ensemble models saphal2020seerl; van2020simple explore by their aggregated uncertainties. With sufficient exploration, this allows policies obtained from surrogate-models to control real-systems. Our safety considerations mean we would prefer agents that learn well-behaving policies with minimal sampling from unsafe regions of state-action spaces; a property we achieve later by incorporating CVaR constraints as we detail in Section 3.1. Requiring a reduced number of visits to unsafe regions, hence, lessens the amount of “unsafe” data gathered in such areas by definition. Therefore, model entropy is naturally increased in these territories and algorithms following such exploration strategies are, as a result, encouraged to sample from hazardous states. As such, a naive adaptation of entropy-based exploration can quickly become problematic by contradicting safety requirements. To circumvent these problems, we introduce two new active exploration (semi-)metrics in Section 3.2
, that assess information beyond training-data availability and consider input-output data distributions. Our (semi-)metrics operate under the assumption that during any model update step, “safe” transition data (i.e., a set of state-action-successor states sampled from safe regions) is more abundant in number when compared to “unsafe” triplets. Considering such a skew between distributions, our (semi-)metrics yield increased values on test queries close to high-density training data. Given such (semi-)metrics, we enable novel model-based algorithms that solve a bi-objective optimisation problem that attempts to minimise cost, while maximising active values. In other words, during this step, we not only update policies to be well-behaving in terms of the total cost but also to actively explore safe transitions that allow for improved models in successor iterations.
Of course, the assumption of having skew towards safe regions in training data distribution is generally not true since solving the above only ensures good expected returns. To frequently sample safe regions, we augment cost minimisation with a safety constraint that is encoded through the CVaR of a user-defined safety cost function with respect to model traces. Hence, SAMBA solves a bi-objective constrained optimisation problem (§3.1) aimed at minimising cost, maximising active exploration, and meeting safety constraints.
3.1 Bi-objective constrained MDPs
Given a (GP) model of an environment, we formalise our problem as a generalisation of constrained MDPs to support bi-objective losses. We define a bi-objective MDP by a tuple consisting of state space , action space , Gaussian process transition model , cost function , constraint cost function (used to encode safety) , additional objective function , and discount factor .111It is worth noting that the (semi-)metrics we develop in Section 3.2 map to instead of . In our setting, for instance, encodes the state-action’s risk by measuring the distance to an unsafe region, while denotes an active (semi-)metric from these described in Section 3.2. To finalise the problem definition of bi-objective MDPs, we need to consider an approachable constraint to describe safety considerations. In incorporating such constraints, we are chiefly interested in those that are flexible (i.e., can support different user-designed safety criteria) and allow us to quantify events occurring in tails of cost distributions. When surveying single-objective constrained MDPs, we realise that the literature predominately focuses on expectation-type constraints achiam2017constrained; raybenchmarking – a not so flexible approach restricted to being safe on average. Others, however, make use of conditional-value-at-risk (CVaR); a coherent risk measure chow2017risk that provides powerful and flexible notions of safety (i.e., can support expectation, tail-distribution, or hard – unsafe state visitation – constraints) and quantifies tail risk in the worst, is defined as: , where . With such a constraint, we can write the optimisation problem of our bi-objective MDP as:
with being total accumulated safety cost along , and a safety threshold. Of course, Equation 1 describes a problem not standard to reinforcement learning. In Section 3.3, we devise policy multi-gradient updates to determine .
3.2 -functions for safe active exploration
In general, can be any bounded objective that needs to be maximised/minimised in addition to standard cost. Here, we choose one that enables active exploration in safe state-action regions. To construct , we note that a feasible policy – i.e., one abiding by CVaR constraints – of the problem in Equation 1 samples tuples that mostly reside in safe regions. As such, the training data distribution is skewed in the sense that safe state-action pairs are more abundant than unsafe ones. Exploiting such skewness, we can indirectly encourage agents to sample safe transitions by maximising information gain (semi-)metrics that only grow in areas close-enough to training data.
: Leave-One-Out semi-metric
Consider a GP dynamics model, , that is trained on a state-action-successor-state data set with .222As changes at every outer iteration, we simply concatenate all data in one larger data set; see Algorithm 1. Such a GP induces a posterior allowing us to query predictions on test points . As noted in Section 2.1, the posterior is also Gaussian with the following mean vector and covariance matrix:333For clarity, we describe a one-dimensional scenario. We extend to multi-output GPs in our experiments. . Our goal is to design a measure that increases in regions with dense training-data (due to the usage of CVaR constraint) to aid agents in exploring novel yet safe tuples. To that end, we propose using an expected leave-one-out semi-metric between two Gaussian processes defined, for a one query data point , as: with being with point left-out. Importantly, such a measure will only grow in regions which are close-enough to sampled training data, as posterior mean and covariance of shift by a factor that scales linearly and quadratically, respectively, with the total covariance between and where denotes a feature matrix with the row removed.444Though intuitive, we provide a formal treatment of the reason behind such growth properties in the appendix. In other words, such a semi-metric fulfils our requirement in the sense that if a test query is distant (in distribution) from all training input data, it will achieve low score. Though appealing, computing a full-set of can be highly computationally expensive, of the order of – computing requires and this has to be repeated times. A major source contributing to this expense, well-known in GP literature, is related to the need to invert covariance matrices. Rather than following variational approximations (which constitute an interesting direction), we prioritise sample-efficiency and focus on exact GPs. To this end, we exploit the already computed during the model-learning step and make-use of the matrix inversion lemma petersen2008matrix to recursively update the mean and covariances of for all (see appendix): , with being the row of . Hence, updating the inverse covariance matrix only requires computing and adding the outer product of the row , divided by the diagonal element. This, in turn, reduces complexity from to .
: Bootstrapped symmetric metric
We also experimented with another metric that quantifies posterior sensitivity to bi-partitions of the data as measured by symmetric KL-divergence,555A symmetric KL-divergence between two distributions p, and q is defined as: . averaged over possible bi-partitions: , where is a random bi-partition of the data . In practice, we randomly split the data in half, and do this times (where is a tuneable hyper-parameter) to get a collection of bi-partitions. We then average over that collection. Similar to , also assigns low importance to query points far from the training inputs, and hence, can be useful for safe-decision making. In our experiments, provided better-behaving exploration strategy, see Section 4.
Both introduced functions are defined in terms of query test points . To incorporate in Equation 1, we define trajectory-based expected total information gain as . Interestingly, this characterisation trades off long-term versus short-term information gain similar to how cost trades-off optimal greedy actions versus long-term decisions. In other words, it is not necessarily optimal to seek an action that maximises immediate information gain since such a transition can ultimately drive the agent to unsafe states (i.e., ones that exhibit low values). In fact, such horizon-based definitions have also recently been shown to improve modelling of dynamical systems buisson2019actively; shyam2018model. Of course, our problem is different in the sense that we seek safe policies in a safe decision-making framework, and thus require safely exploring (semi-)metrics.
3.3 Solution method
We now provide a solver to the problem in Equation 1. We operate using and note that our derivations can exactly be repeated for . Since we maximise exploration, we use in the minimisation problem in Equation 1. Effectively, we need to overcome two hurdles for an implementable algorithm. The first, relates to the bi-objective nature of our problem, while the second is concerned with the CVaR constraint that requires a Lagrangian-type solution.
From Bi- to Single objectives: We transform the bi-objective problem into a single objective one through a linear combination of and . This relaxed yet constrained version is given by ,666Please note that the negative sign in the linear combination is due to the fact that we used . where is a policy dependent weighting. The choice of , however, can be difficult as not any arbitrary combination is acceptable as it has to ultimately yield a Pareto-efficient solution. Fortunately, the authors in sener2018multi have demonstrated that a theoretically-grounded choice for such a weighting in a stochastic multi-objective problem is one that produces a common descent direction that points opposite to the minimum-norm vector in the convex hull of the gradients, i.e., one solving: . Luckily, solving for is straight-forward and can be encoded using a rule-based strategy that compares gradient norms, see appendix for further details.
From Constrained to Unconstrained Objectives: We write an unconstrained problem using a Lagrange multiplier : . Due to non-convexity of the problem, we cannot assume strong duality holds, so in our experiments, we schedule proportional to gradients using a technique similar to that in schulman2017proximal that has proven effective.777Note that a primal dual-method as in chow2015risk is not applicable due to non-convexity. In the future, we plan to study approaches from goh2001nonlinear to ease determining . To solve the above optimisation problem, we first fix and perform a policy gradient step in .888We resorted to policy gradients for two reason: 1) cost functions are not necessarily differentiable, and 2) better experimental behaviour when compared to model back-prop especially on OpenAI’s safety gym tasks.
To minimise the variance in the gradient estimator ofand
, we build two neural network critics that we use as baselines. The first attempts to model the value of the standard cost, while the second learns information gain values. For the CVaR’s gradient, we simply apply policy gradients. As CVaR is non-Markovian, it is difficult to estimate its separate critic. In our experiments, a heuristic where discounted safety losses as unbiased baselines was used and proved effective. In short, our main update equations when using a policy parameterised by a neural network with parameterscan be written as:
where is a learning rate, and , are neural network critics with parameters and . We present the main steps in Algorithm 1 and more details in the appendix.
We assess SAMBA in terms of both safe learning (train) and safe final policies (test) on three dynamical systems, two of which are adaptations of standard dynamical systems for MBRL (Safe Pendulum and Safe Cart-Pole Double Pendulum), while the third (Fetch Robot – optimally control end-effector to reach a 3D goal) we adapt from OpenAI’s robotics environments brockman2016openai. In each of these tasks, we define unsafe regions as areas in state spaces and design the safety loss (i.e., ) to correspond to the (linearly proportional) distance between the end-effector’s position (when in the hazard region) to the centre of the unsafe region. SAMBA implemented a more stable proximal update of Equation 2 following a similar method to schulman2017proximal. We compare against algorithms from both model-free and model-based literature. Model-free comparisons against TRPO schulman2015trust, PPO schulman2017proximal, CPO achiam2017constrained, STRPO (safety-constrained TRPO) raybenchmarking and SPPO raybenchmarking (safety-constrained PPO) enable us to determine if SAMBA improves upon the following: sample complexities during training; total violations (TV), that is, the total number of timesteps spent inside the unsafe region; total accumulated safety cost (TC).999Note, we report safe learning process TC, which is the total incurred safety cost throughout all training environment interactions, and safe evaluation TC and TV, which similarly is the total incurred safety cost during evaluation, and the total violations from the timesteps spent inside the unsafe region during evaluation. Comparison with model-based solvers (e.g., PlaNet hafner2018learning, (P)PILCO deisenroth2011pilco
) sheds light on the importance of our active exploration metrics. It is important to note that when implementing PILCO, we preferred a flexible solution that does not assume moment-matching and specific radial-basis function controllers. Hence, we adapted PILCO to support proximal policy updates, referred to as PPILCO in our experiments, and similarly, SPPILCO (safety-constrained PPILCO), which also proved effective. As SAMBA introduces exploration components to standard model-based learners, we analysed these independently before combining them and reporting TV and TC9 (see Table 1). All policies are represented by two-hidden-layer (32 units each) neural networks with non-linearities. Each requires under 12 hours of training on an NVIDIA GeForce RTX 2080 Ti which has a power consumption of 225W, yielding an estimated training cost of £ per model, per seed. Due to space constraints, all hyper-parameters to reproduce our results can be found in the appendix.
(Semi-)metrics Component: Evaluating our (semi-metrics), we conducted an analysis that reports a 2D projection view of the state space at various intervals in the data-collection process. We compare and against an entropy-based exploration metric and report the results on two systems in Figure 1. It is clear that both our (semi-)metrics encourage agents to explore safe regions in the state space as opposed to entropy that mostly grows in unsafe regions. Similar results are demonstrated with the Fetch Reach robot task (in the appendix). It is also worth noting that due to the high-dimensional nature of the tasks, visual analysis can only give indications. Still, empirical analysis supports our claim and performance improvements are clear; see Table 1.
Learning and Evaluation: Having concluded that our (semi-)metrics indeed provide informative signals to the agent for safe exploration, we then conducted learning and evaluation experiments comparing SAMBA against state-of-the-art methods. Results reported in Table 1 demonstrate that SAMBA reduces the amount of training TC, 9 and samples by orders of magnitude compared to others. Interestingly, during safe evaluation (deploying learnt policy and evaluating performance), we see SAMBA’s safety performance competitive with (if not significantly better than) policies trained for safety in terms of TC and TV. 9 Such results are interesting as SAMBA was never explicitly designed to specifically minimise test TV, 9 but it was still able to acquire significant reductions. Of course, one might argue that these results do not convey a safe final policy, as violations are still non-zero. We remind, however, that we define safety in terms of CVaR constraints, which do not totally prohibit violations, rather, limit the average cost of excess violations beyond a (user-defined) risk level . Indeed, as mentioned above, it is not possible to guarantee total safety without strong assumptions on dynamics and/or an initial safe policy (and of course, none of the algorithms in Table 1 have zero violations).
|Learner||Safe Pendulum||Safe Cart Pole Double Pendulum||Safe Fetch Reach|
|PlaNet w RS||100||51||3.5||4.2||40||1.3||1.6||2.5||-||-||-||-|
The data is scaled by . Note: PlaNet w RS on Safe Fetch Reach diverged during training.
To evaluate risk performance, we conducted an in-depth study evaluating all methods on the two most challenging environments, Safe Cart Pole Double Pendulum and Safe Fetch Reach, using cost limits and . Table 2
demonstrates that SAMBA achieves lower safety cost quartiles and lower expected safety cost. Therefore, we conclude that, indeed, SAMBA produces safe-final policies in terms of its objective in Equation1.
|Learner||Safe Cart Pole Double Pendulum||Safe Fetch Reach|
|PlaNet w RS||0.83||0.94||1.11||-24||-||-||-||-||-||-|
Note: PlaNet w RS on Safe Fetch Reach diverged during training.
5 Conclusion and future work
We proposed SAMBA, a safe and active model-based learner that makes use of GP models and solves a bi-objective constraint problem to trade-off cost minimisation, exploration, and safety. We evaluated our method on three benchmarks, including ones from Open AI’s safety gym and demonstrated significant reduction in training cost and sample complexity, as well as safe-final policies in terms of CVaR constraints compared to the state-of-the-art.
In future, we plan to generalise to variational GPs damianou2011variational, and to apply our method in real-world robotics. Additionally, we want to study the theoretical guarantees of SAMBA to demonstrate convergence to well-behaving, exploratory, and safe policies.
Appendix A Algorithm
In this section we provide a more detailed description of Algorithm 1. We shall use the following short-hand for the advantage function: