LS3: Latent Space Safe Sets for Long-Horizon Visuomotor Control of Iterative Tasks

by   Albert Wilcox, et al.
berkeley college

Reinforcement learning (RL) algorithms have shown impressive success in exploring high-dimensional environments to learn complex, long-horizon tasks, but can often exhibit unsafe behaviors and require extensive environment interaction when exploration is unconstrained. A promising strategy for safe learning in dynamically uncertain environments is requiring that the agent can robustly return to states where task success (and therefore safety) can be guaranteed. While this approach has been successful in low-dimensions, enforcing this constraint in environments with high-dimensional state spaces, such as images, is challenging. We present Latent Space Safe Sets (LS3), which extends this strategy to iterative, long-horizon tasks with image observations by using suboptimal demonstrations and a learned dynamics model to restrict exploration to the neighborhood of a learned Safe Set where task completion is likely. We evaluate LS3 on 4 domains, including a challenging sequential pushing task in simulation and a physical cable routing task. We find that LS3 can use prior task successes to restrict exploration and learn more efficiently than prior algorithms while satisfying constraints. See for code and supplementary material.



page 2

page 6

page 16


Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones

Safety remains a central obstacle preventing widespread use of RL in the...

Broadly-Exploring, Local-Policy Trees for Long-Horizon Task Planning

Long-horizon planning in realistic environments requires the ability to ...

Model-Based Reinforcement Learning via Latent-Space Collocation

The ability to plan into the future while utilizing only raw high-dimens...

Dream to Control: Learning Behaviors by Latent Imagination

Learned world models summarize an agent's experience to facilitate learn...

Analytic Manifold Learning: Unifying and Evaluating Representations for Continuous Control

We address the problem of learning reusable state representations from s...

Extending Deep Model Predictive Control with Safety Augmented Value Estimation from Demonstrations

Reinforcement learning (RL) for robotics is challenging due to the diffi...

Multiplicative Controller Fusion: A Hybrid Navigation Strategy For Deployment in Unknown Environments

Learning-based approaches often outperform hand-coded algorithmic soluti...

Code Repositories


Author implementation of LS3: Latent Space Safe Sets for Long-Horizon Visuomotor Control of Iterative Tasks

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual planning over learned forward dynamics models is a popular area of research in robotic control from images [1, 2, 3, 4, 5, 6, 7], as it enables closed-loop, model-based control for tasks where the state of the system is not directly observable or difficult to analytically model, such as the configuration of a sheet of fabric or segment of cable. These methods learn predictive models over either images or a learned latent space, which are then used by model predictive control (MPC) to optimize image-based task costs. While these approaches have significant promise, there are several open challenges in learning policies from visual observations. First, reward specification is particularly challenging for visuomotor control tasks, because high-dimensional observations often do not expose the necessary features required to design informative reward functions [8], especially for long-horizon tasks. Second, while many prior reinforcement learning methods have been successfully applied to image-based control tasks [9, 10, 11, 12, 13], learning policies from image observations often requires extensive exploration due to the high dimensionality of the observation space and the difficulties in reward specification, making safe and efficient learning exceedingly challenging.

Safe reinforcement learning is of critical importance in robotics, as unconstrained exploration can cause serious damage to the robot and its surroundings [14]. Safe learning often leads to efficient learning, as safe learning prevents the agent from exploring clearly suboptimal behaviors. There has been significant prior work on safe policy learning in low-dimensional observation spaces [15, 16, 17, 18, 19]Thananjeyan et al. [18]

present a safe and efficient algorithm for policy learning for tasks with a low-variance start state distribution and fixed goal state by learning a

Safe Set, which captures the set of states from which the agent has previously completed the task. Safe Sets are a common ingredient in classical control algorithms for guaranteeing policy improvement and constraint satisfaction [19, 17, 20], and restricting exploration to the neighborhood of this set can lead to highly efficient and safe learning for long-horizon tasks [18]. Extending these methods to variable start and goal sets in low-dimensional settings has been studied in Thananjeyan et al. [17]. However, scaling these approaches to high-dimensional image observations is challenging, since images do not directly expose details about the system state or dynamics that are typically needed for formal controller analysis [17, 19, 20]. In this work, we study learning in the iterative setting, where the start and goal sets have low variance, and focus on scaling these approaches to image-based inputs. The proposed algorithm makes several practical relaxations and maintains the same safety guarantees under the same additional assumptions as in Thananjeyan et al. [18, 17].

We introduce Latent Space Safe Sets (LS), a model-based RL algorithm for visuomotor policy learning that provides safety by learning a continuous relaxation of a Safe Set in a learned latent space. This Latent Space Safe Set ensures that the agent can plan back to regions in which it is confident in task completion even when learning in high dimensional spaces. This constraint makes it possible to (1) improve safely by ensuring that the agent can consistently complete the task (and therefore avoid unsafe behavior) and (2) learn efficiently since the agent only explores promising states in the immediate neighborhood of those in which it was previously successful. LS

additionally enforces user-specified state space constraints by estimating the probability of constraint violations over a learned, probabilistic, latent space dynamics model. We contribute (1) Latent Space Safe Sets (LS

), a novel reinforcement learning algorithm for safely and efficiently learning long-horizon visual planning tasks, (2) simulation experiments on 3 continuous control visuomotor control tasks suggesting that LS can learn to improve upon demonstrations more safely and efficiently than baselines, (3) physical experiments on a vision-based cable routing task on the da Vinci surgical robot suggesting that LS can learn a policy more efficiently than prior algorithms while consistently completing the task and satisfying constraints during learning.

Figure 1: Latent Space Safe Sets (LS): At time , LS observes an image

of the environment. The image is first encoded to a latent vector

. Then, LS uses a sampling-based optimization procedure to optimize -length action sequences by sampling -length latent trajectories over the learned latent dynamics model . For each sampled trajectory, LS checks whether latent space obstacles are avoided and if the terminal state in the trajectory falls in the Latent Space Safe Set. The terminal state constraint encourages the algorithm to ensure it can plan back to regions of safety and task confidence, but still enables exploration. For feasible trajectories, the sum of rewards and value of the terminal state are computed and used for sorting. LS executes the first action in the optimized plan and then performs this procedure again at the next timestep.

2 Related Work

2.1 Safe, Iterative Learning Control

In iterative learning control (ILC), the agent tracks an initially provided reference trajectory and uses data from controller rollouts to iteratively refine tracking performance [21]Rosolia et al. [22], Rosolia and Borrelli [23, 19] present a new class of algorithms, known as Learning Model Predictive Control (LMPC), which are reference-free and instead iteratively improve upon the performance of an initial feasible trajectory. To achieve this, Rosolia et al. [22], Rosolia and Borrelli [23, 19] present model predictive control algorithms that use data from controller rollouts to learn a Safe Set and value function, with which recursive feasibility, stability, and local optimality can be guaranteed given a known, deterministic nonlinear system or stochastic linear system under certain regularity assumptions. However, a core challenge with these algorithms is that they assume that system dynamics are known, and cannot easily be applied to high-dimensional control problems. Thananjeyan et al. [18] extends the LMPC framework to higher dimensional control settings in which system dynamics are unknown and must be estimated iteratively from experience, but the visuomotor control setting introduces a number of new challenges for iterative learning control algorithms such as learning system dynamics, Safe Sets, and value functions which can flexibly and efficiently accommodate visual inputs.

2.2 Model Based Reinforcement Learning

There has been significant recent progress in algorithms which combine ideas from model-based planning and control with deep learning 

[24, 25, 26, 27, 28, 29]. These algorithms are gaining popularity in the robotics community as they enable leaning complex policies from data while maintaining some of the sample efficiency and safety benefits of classical model-based control techniques. However, these algorithms typically require hand-engineered dense cost functions for task specification, which can often be difficult to provide, especially in high-dimensional spaces. This motivates leveraging demonstrations (possibly suboptimal) to provide an initial signal regarding desirable agent behavior. There has been some prior work on leveraging demonstrations in model-based algorithms such as Quinlan and Khatib [30] and Ichnowski et al. [31], which use model-based control with known dynamics to refine initially suboptimal motion plans, and Fu et al. [26], which uses demonstrations to seed a learned dynamics model for fast online adaptation using iLQR [26]Thananjeyan et al. [18], Zhu et al. [32] present ILC algorithms which rapidly improve upon suboptimal demonstrations when system dynamics are unknown. However, these algorithms either require knowledge of system dynamics [30, 31] or are limited to low-dimensional state spaces [26, 18, 32] and cannot be flexibly applied to visuomotor control tasks.

2.3 Reinforcement Learning from Pixels

Reinforcement learning and model-based planning from visual observations is gaining significant recent interest as RGB images provide an easily available observation space for robot learning [1, 33]. Recent work has proposed a number of model-free and model-based algorithms that have seen success in laboratory settings in a number of robotic tasks when learning from visual observations [34, 35, 10, 36, 12, 13, 1, 37, 33]. However, two core issues that prevent application of many RL algorithms in practice, inefficient exploration and safety, are significantly exacerbated when learning from high-dimensional visual observations in which the space of possible behaviors is very large and the features required to determine whether the robot is safe are not readily exposed. There has been significant prior work on addressing inefficiencies in exploration for visuomotor control such as latent space planning [2, 33, 37] and goal-conditioned reinforcement learning [13, 10]. However, safe reinforcement learning for visuomotor tasks has received substantially less attention. Thananjeyan et al. [14] and Kahn et al. [38] present reinforcement learning algorithms which estimate the likelihood of constraint violations to avoid them [14] or reduce the robot’s velocity [38]. Unlike these algorithms, which focus on presenting methods to avoid violating user-specified constraints, LS additionally provides consistent task completion during learning by limiting exploration to the neighborhood of prior task successes. This difference makes LS less susceptible to the challenges of unconstrained exploration present in standard model-free reinforcement learning algorithms.

3 Problem Statement

We consider an agent interacting in a finite horizon goal-conditioned Markov Decision Processes (MDP) which can be described with the tuple

. and are the state and action spaces,

maps a state and action to a probability distribution over subsequent states,

is the reward function, is the initial state distribution (), and is the time horizon. In this work, the agent is only provided with RGB image observations , where and are the image width and height in pixels, respectively. We consider tasks in the iterative learning control setting, where the agent must reach goal set as efficiently as possible and the support of is small. While there are a number of possible choices of reward functions that would encourage fast convergence to , providing shaped reward functions can be exceedingly challenging, especially when learning complex tasks in which agents are only provided with high dimensional observations. Thus, as in Thananjeyan et al. [18], we consider a sparse reward function that only provides a signal upon task completion: if and otherwise. To incorporate constraints, we augment with an extra constraint indicator function which indicates whether a state satisfies user-specified state-space constraints, such as avoiding known obstacles. This is consistent with the modified CMDP formulation used in [14]. We assume that and can be evaluated on the current state of the system, but cannot be used for planning. We make this assumption because in practice we plan over predicted future states, which may not be predicted at sufficiently high fidelity to expose the necessary information to directly evaluate and during planning.

Given a policy , its expected total return in can be defined as . Furthermore, we define as the probability of future constraint violation (within time horizon ) under policy from state . Then the objective is to maximize the expected return while maintaining a constraint violation probability lower than some . This can be written formally as follows:


We assume that the agent is provided with an offline dataset of transitions in the environment of which some subset are constraint violating and some subset appear in successful demonstrations from a suboptimal supervisor. As in [14], contains examples of constraint violating behaviors (for example from prior runs of different policies or collected under human supervision) so that the agent can learn about states which violate user-specified constraints.

4 Latent Space Safe Sets (LS)

Here we describe how LS uses demonstrations and online environment interaction to safely learn iteratively improving policies. Section 4.1 describes how we learn a low-dimensional latent representation of image observations to facilitate efficient model-based planning. To enable this planning, we learn a probabilistic forward dynamics model as in [28] in the learned latent space and models to estimate whether plans will likely complete the task (Section 4.2) and to estimate future rewards and constraint violations (Section 4.3) from predicted trajectories. Finally, in Section 4.4, we discuss how all of these components are combined in LS to enable safe and efficient policy improvement. Dataset is expanded using online rollouts of LS and used to update all latent space models (Sections 4.2 and 4.3) after every rollouts. See Algorithm 1 and the supplement for further details on training procedures and data collection for all components.

Figure 2: LS Learned Models: LS learns a low-dimensional latent representation of image-observations (Section 4.1

) and learns a dynamics model, value function, reward function, constraint classifier, and Safe Set for constrained planning and task-completion driven exploration in this learned latent space. These models are then used for model-based planning to maximize the total value of predicted latent states (Section 

4.3) while enforcing the Safe Set (Section 4.2) and user-specified constraints (Section 4.3).
1:offline dataset , number of updates
2:Train VAE encoder and decoder (Section 4.1) using data from
3:Train dynamics , Safe Set classifier (Section 4.2), and the value function goal indicator , and constraint estimator (Section 4.3) using data from .
4:for  do
5:     for  do
6:         Sample starting state from .
7:         for  do
8:              Choose and execute (Section 4.4)
9:              Observe , reward , constraint .
11:     Update , , , , and with data from .
Algorithm 1 Latent Space Safe Sets (LS)

4.1 Learning a Latent Space for Planning

Learning compressed representations of images has been a popular approach in vision based control to facilitate efficient algorithms for planning and control which can reason about lower dimensional inputs [2, 37, 6, 39, 40, 33]. To learn such a representation, we train a

-variational autoencoder 

[41] on states in to map states to a probability distribution over a -dimensional latent space . The resulting encoder network is then used to sample latent vectors to train a forward dynamics model, value function, reward estimator, constraint classifier, Safe Set, and combine these elements to define a policy for model-based planning. Motivated by Laskin et al. [42], during training we augment inputs to the encoder with random cropping, which we found to be helpful in learning representations that are useful for planning. For all environments we use a latent dimension of , as in [2] and found that varying did not significantly affect performance.

4.2 Latent Safe Sets for Model-Based Control

LS learns a binary classifier for latent states to learn a latent space Safe Set that represents states from which the agent has high confidence in task completion based on prior experience. Because the agent can reach the goal from these states, they are safe: the agent can avoid constraint violations by simply completing the task as before. While classical algorithms use known dynamics to construct Safe Sets, we approximate this set using successful trajectories from prior iterations. At each iteration, the algorithm collects trajectories in the environment. Let denote the set of states in trajectory of iteration : . Let denote the set of tuples such that trajectory of was successful: }. We define the sampled Safe Set at iteration , as follows: . In short, this is the set of states from which the agent has successfully navigated to at iteration of training.

This discrete set is difficult to plan to with continuous-valued state distributions so we leverage data from (data in the discrete Safe Set), data from (data outside the Safe Set), and the learned encoder from Section 4.1

to learn a continuous relaxation of this set in latent space (the Latent Safe Set). We train a neural network with a binary cross-entropy loss to learn a binary classifier

that predicts the probability of a state with encoding being in . To mitigate the negative bias that appears when trajectories that start in safe regions fail, we utilize the intuition that if a state then it is likely that is also safe. To do this, rather than just predict , we train to predict . The relaxed Latent Safe Set is parameterized by the superlevel sets of , where the level is adaptively set during execution: .

4.3 Reward and Constraint Estimation

In this work, we define rewards based on whether the agent has reached a state , but we need rewards that are defined on predictions from the dynamics, which may not correspond to valid real images. To address this, we train a classifier to map the encoding of a state to whether the state is contained in using terminal states in (which are known to be in ) and other states in . However, in the temporally-extended, sparse reward tasks we consider, reward prediction alone is insufficient because rewards only indicate whether the agent is in the goal set, and thus provide no signal on task progress unless the agent can plan to the goal set. To address this, as in prior MPC-literature [18, 17, 19, 8], we train a recursively-defined value function (details in the supplement). Similar to the reward function, we use the encoder (Section 4.1) to train a classifier with data of constraint violating states from and the constraint satisfying states in to map the encoding of a state to the probability of constraint violation.

4.4 Model-Based Planning with LS

LS aims to maximize total rewards attained in the environment while limiting constraint violation probability within some threshold (equation 1). We optimize an approximation of this objective over an -step receding horizon with model-predictive control. Precisely, LS solves the following optimization problem to generate an action to execute at timestep :

s.t. (3)

In this problem, the expectations and probabilities are taken with respect to the learned, probabilistic dynamics model . The optimization problem is solved approximately using the cross-entropy method (CEM) [43] which is a popular optimizer in model-based RL [44, 18, 17, 45, 14].

The objective function is the expected sum of future rewards if the agent executes and then subsequently executes (equation 2). First, the current state is encoded to (equation 3). Then, for a candidate sequence of actions , an -step latent trajectory is sampled from the learned dynamics (equation 4). LS constrains exploration using two chance constraints: (1) the terminal latent state in the plan must fall in the Safe Set (equation 5) and (2) all latent states in the trajectory must satisfy user-specified state-space constraints (equation 6). is the set of all latent states such that the corresponding observation is constraint violating. The optimizer estimates constraint satisfaction probabilities for a candidate action sequence by simulating it repeatedly over . The first chance constraint ensures the agent maintains the ability to return to safe states where it knows how to the task within steps if necessary. Because the agent replans at each timestep, the agent need not return to the Safe Set: during training, the Safe Set expands, enabling further exploration. In practice, we set for the Safe Set classifier adaptively as described in the supplement. The second chance constraint encourages constraint violation probability of no more than . After solving the optimization problem, the agent executes the first action in the plan: where is the first element of , observes a new state, and replans.

Figure 3: Experimental Domains: LS

is evaluated on 3 long-horizon, image-based, simulation environments: a visual navigation domain where the goal is to navigate the blue point mass to the right goal set while avoiding the red obstacle, a 2 degree of freedom reacher arm where the task is to navigate around a red obstacle to reach the yellow goal set, and a sequential pushing task where the robot must push each of 3 blocks forward a target displacement from left to right. We also evaluate LS

on a physical, cable-routing task on a da Vinci Surgical Robot, where the goal is to guide a red cable to a green target without the cable or robot arm colliding with the blue obstacle. This requires learning visual dynamics, because the agent must model how the rest of the cable will deform during manipulation to avoid collisions with the obstacle.

5 Experiments

We evaluate LS on 3 robotic control tasks in simulation and a physical cable routing task on the da Vinci Research Kit (dVRK) [46]. Safe RL is of particular interest for surgical robots such as the dVRK due to its delicate structure, motivating safety, and relatively imprecise controls [18, 47], motivating closed-loop control. We study whether LS can learn more safely and efficiently than algorithms that do not structure exploration based on prior task successes.

5.1 Comparisons

We evaluate LS in comparison to prior algorithms that behavior clone suboptimal demonstrations before exploring online (SACfD[48] or leverage offline reinforcement learning to learn a policy using all offline data before updating the policy online (AWAC[49]. For both of these comparisons we enforce constraints via a tuned reward penalty of for constraint violations as in [16]. We also implement a version of SACfD with a learned recovery policy (SACfD+RRL) using the Recovery RL algorithm [14] to use prior constraint violating data to try to avoid constraint violating states. Finally, we compare LS to an ablated version without the Safe Set constraint in equation 5 (LS (Safe Set)

) to evaluate if the Safe Set promotes consistent task completion and stable learning. See the supplement for details on hyperparameters and offline data used for LS

and prior algorithms.

Figure 4: Simulation Experiments Results:

Learning curves showing mean and standard error over 10 random seeds. We see that LS

consistently learns more quickly than baselines, as well as the ablated algorithm without the Safe Set. Although SACfD and SACfD+RRL eventually achieve similar reward values, we see that LS is much more sample efficient and stable across random seeds.

5.2 Evaluation Metrics

For each algorithm on each domain, we aggregate statistics over random seeds (10 for simulation experiments, 3 for the physical experiment), reporting the mean and standard error across the seeds. We present learning curves that show the total sum reward for each training trajectory to study how efficiently LS and the comparisons learn each task. Because all tasks use the sparse task completion based rewards defined in Section 3, the total reward for a trajectory is the time to reach the goal set, where more negative rewards correspond to slower convergence to . Thus, for a task with task horizon , a total reward greater than implies successful task completion. The state is frozen in place upon constraint violation until the task horizon elapses. We report task success and constraint satisfaction rates for LS and comparisons to study if algorithms consistently complete the task during learning and satisfy user-specified state-space constraints. LS collects trajectories in between training phases on simulated tasks and in between training phases for physical tasks, while the SACfD and AWAC comparisons update their parameters after each timestep. This presents a calibrated metric in terms of the amount of data collected across algorithms.

5.3 Domains

In simulation, we evaluate LS on 3 vision-based continuous control domains that are illustrated in Figure 3. We evaluate LS and comparisons on a constrained visual navigation task (Pointmass Navigation) where the agent navigates from a fixed start state to a fixed goal set while avoiding a large central obstacle. We study this domain to gain intuition and visualize the learned value function, goal/constraint indicators, and Safe Set in Figure 2. We then study a constrained image-based reaching task (Reacher) based on [50], where the objective is to navigate the end effector of a 2-link planar robotic arm to a yellow goal position without the end-effector entering a red stay out zone. We then study a challenging sequential image-based robotic pushing domain (Sequential Pushing), in which the objective is to push each of the 3 blocks forward on the table without pushing them to either side and causing them to fall out of the workspace. Finally, we evaluate LS with an image-based physical experiment on the da Vinci Research Kit (dVRK) [51] (Figure 3), where the objective is to guide the endpoint of a cable to a goal region without letting the cable or end effector collide with an obstacle. The Pointmass Navigation and Reaching domains have a task horizon of while the Sequential Pushing domain and physical experiment have task horizons of and respectively. See the supplement for more details on all domains.

5.4 Simulation Results

We find that LS is able to learn more stably and efficiently than all comparisons across all simulated domains while converging to similar performance within 250 trajectories collected online (Figure 4). LS is able to consistently complete the task during learning, while the comparisons, which do not learn a Safe Set to structure exploration based on prior successes, exhibit much less stable learning. Additionally, in Table 2 and Table 2, we report the task success rate and constraint violation rate of all algorithms during training. We find that LS achieves a significantly higher task success rate than comparisons on all tasks. We also find that LS violates constraints less often than comparisons on the Reacher task, but violates constraints more often than SACfD and SACfD+RRL on the other domains, likely due to SACfD and SACfD+RRL spending much less time in the neighborhood of constraint violating states during training due to their lower task success rates. We find that the AWAC comparison achieves very low task performance. While AWAC is designed for offline reinforcement learning, to the best of our knowledge, it has not been previously evaluated on long-horizon, image-based tasks as in this paper, which we hypothesize are very challenging for it.

We find LS has a lower success rate when the Safe Set constraint is removed (LS(Safe Set)) as expected. The Safe Set is particularly important in the sequential pushing task, and LS (Safe Set) has a much lower task completion rate than LS. See the supplement for details on experimental parameters and offline data used for LS and comparisons and ablations studying the effect of the planning horizon and threshold used to define the Safe Set.

Pointmass Navigation
Sequential Pushing
Table 2: Constraint Violation Rate: We report mean and standard error of training-time constraint violation rate over 10 random seeds. LS violates constraints less than comparisons on the Reacher task, but SAC and SACfD+RRL achieve lower constraint violation rates on the Navigation and Pushing tasks, likely due to spending less time in the neighborhood off constraint violating regions due to their much lower task success rates.
Pointmass Navigation
Sequential Pushing
Table 1: Task Success Rate: We present the mean and standard error of training-time task completion rate over 10 random seeds. We find LS outperforms all comparisons across all 3 domains, with the gap increasing for the challenging sequential pushing task.

5.5 Physical Results

In physical experiments, we compare LS to SACfD and SACfD+RRL (Figure 5) on the physical cable routing task illustrated in Figure 3. We find LS quickly outperforms the suboptimal demonstrations while succeeding at the task significantly more often than both comparisons, which are unable to learn the task and also violate constraints more than LS. We hypothesize that the difficulty of reasoning about cable collisions and deformation from images makes it challenging for prior algorithms to make sufficient task progress as they do not use prior successes to structure exploration. See the supplement for details on experimental parameters and offline data used for LS and comparisons. We use data augmentation to increase the size of the dataset used to train and , taking the images in and creating an expanded dataset by adding randomly sampled affine translations and perspective shifts, until .

Figure 5: Physical Cable Routing Results: We present learning curves, task success rates and constraint violation rates with a mean and standard error across 3 random seeds. LS quickly learns to complete the task more efficiently than the demonstrator while still violating constraints less often than the comparisons, which are unable to learn the task.

6 Discussion and Future Work

We present LS, a scalable algorithm for safe and efficient policy learning for visuomotor tasks. LS structures exploration by learning a safe set in a learned latent space, which captures the set of states from which the agent is confident in task completion. LS then ensures that the agent can plan back to states in safe set, encouraging consistent task completion the task during learning. Experiments suggest that LS is able to use this procedure to safely and efficiently learn 4 visuomotor control tasks, including a challenging sequential pushing task in simulation and a cable routing task on a physical robot. In future work, we are excited to explore further physical evaluation of LS on safety critical visuomotor control tasks such as navigation of home support robots or surgical automation.


  • Ebert et al. [2018] F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568, 2018.
  • Hafner et al. [2019] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels.

    Proc. Int. Conf. on Machine Learning

    , 2019.
  • Hoque et al. [2020] R. Hoque, D. Seita, A. Balakrishna, A. Ganapathi, A. K. Tanwani, N. Jamali, K. Yamane, S. Iba, and K. Goldberg. Visuospatial foresight for multi-step, multi-task fabric manipulation. Proc. Robotics: Science and Systems (RSS), 2020.
  • Lenz et al. [2015] I. Lenz, R. A. Knepper, and A. Saxena. Deepmpc: Learning deep latent features for model predictive control. In Robotics: Science and Systems. Rome, Italy, 2015.
  • Nair and Finn [2019] S. Nair and C. Finn.

    Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation.

    Proc. Int. Conf. on Learning Representations, 2019.
  • Nair et al. [2020] S. Nair, S. Savarese, and C. Finn. Goal-aware prediction: Learning to model what matters. In Proceedings of the 37th International Conference on Machine Learning, pages 7207–7219, 2020.
  • Pertsch et al. [2020] K. Pertsch, O. Rybkin, F. Ebert, C. Finn, D. Jayaraman, and S. Levine. Long-horizon visual planning with goal-conditioned hierarchical predictors. Proc. Advances in Neural Information Processing Systems, 2020.
  • Tian et al. [2021] S. Tian, S. Nair, F. Ebert, S. Dasari, B. Eysenbach, C. Finn, and S. Levine. Model-based visual planning with self-supervised functional distances. Proc. Int. Conf. on Learning Representations, 2021.
  • [9] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine. Soft actor-critic algorithms and applications.
  • Nair et al. [2018] A. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine. Visual reinforcement learning with imagined goals. Proc. Advances in Neural Information Processing Systems, 2018.
  • Levine et al. [2016] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 2016.
  • Kalashnikov et al. [2018] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. Conf. on Robot Learning (CoRL), 2018.
  • Pong et al. [2020] V. H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine. Skew-fit: State-covering self-supervised reinforcement learning. Proc. Int. Conf. on Machine Learning, 2020.
  • Thananjeyan et al. [2020] B. Thananjeyan, A. Balakrishna, S. Nair, M. Luo, K. Srinivasan, M. Hwang, J. E. Gonzalez, J. Ibarz, C. Finn, and K. Goldberg. Recovery rl: Safe reinforcement learning with learned recovery zones. NeurIPS Deep Reinforcement Learning Workshop, 2020.
  • Achiam et al. [2017] J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In Journal of Machine Learning Research, 2017.
  • Tessler et al. [2019] C. Tessler, D. J. Mankowitz, and S. Mannor. Reward constrained policy optimization. In Proc. Int. Conf. on Learning Representations, 2019.
  • Thananjeyan et al. [2020a] B. Thananjeyan, A. Balakrishna, U. Rosolia, J. E. Gonzalez, A. Ames, and K. Goldberg. Abc-lmpc: Safe sample-based learning mpc for stochastic nonlinear dynamical systems with adjustable boundary conditions, 2020a.
  • Thananjeyan et al. [2020b] B. Thananjeyan, A. Balakrishna, U. Rosolia, F. Li, R. McAllister, J. E. Gonzalez, S. Levine, F. Borrelli, and K. Goldberg. Safety augmented value estimation from demonstrations (saved): Safe deep model-based rl for sparse cost robotic tasks. IEEE Robotics and Automation Letters, 5(2):3612–3619, 2020b.
  • Rosolia and Borrelli [2018] U. Rosolia and F. Borrelli. Learning model predictive control for iterative tasks. a data-driven control framework. IEEE Transactions on Automatic Control, 2018.
  • Bansal et al. [2017] S. Bansal, M. Chen, S. Herbert, and C. J. Tomlin. Hamilton-jacobi reachability: A brief overview and recent advances. In Conference on Decision and Control (CDC), 2017.
  • Bristow et al. [2006] D. A. Bristow, M. Tharayil, and A. G. Alleyne. A survey of iterative learning control. IEEE control systems magazine, 2006.
  • Rosolia et al. [2018] U. Rosolia, X. Zhang, and F. Borrelli. A Stochastic MPC Approach with Application to Iterative Learning. 2018 IEEE Conference on Decision and Control (CDC), 2018.
  • Rosolia and Borrelli [2019] U. Rosolia and F. Borrelli. Sample-based learning model predictive control for linear uncertain systems. CoRR, abs/1904.06432, 2019. URL
  • Deisenroth and Rasmussen [2011] M. Deisenroth and C. Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In Proc. Int. Conf. on Machine Learning, 2011.
  • Lenz et al. [2015] I. Lenz, R. A. Knepper, and A. Saxena. DeepMPC: Learning deep latent features for model predictive control. In Robotics: Science and Systems, 2015.
  • Fu et al. [2016] J. Fu, S. Levine, and P. Abbeel. One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. In Proc. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2016.
  • Lowrey et al. [2019] K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch. Plan online, learn offline: Efficient learning and exploration via model-based control. In Proc. Int. Conf. on Machine Learning, 2019.
  • Chua et al. [2018] K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Proc. Advances in Neural Information Processing Systems, 2018.
  • Nagabandi et al. [2018] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In Proc. IEEE Int. Conf. Robotics and Automation (ICRA), 2018.
  • Quinlan and Khatib [1993] S. Quinlan and O. Khatib. Elastic bands: connecting path planning and control. In International Conference on Robotics and Automation, pages 802–807 vol.2, 1993.
  • Ichnowski et al. [2020] J. Ichnowski, Y. Avigal, V. Satish, and K. Goldberg. Deep learning can accelerate grasp-optimized motion planning. Science Robotics, 5(48), 2020.
  • Zhu et al. [2021] Z. Zhu, N. Pivaroa, S. Gupta, A. Gupta, and M. Canova. 2021.
  • Lippi et al. [2020] M. Lippi, P. Poklukar, M. C. Welle, A. Varava, H. Yin, A. Marino, and D. Kragic. Latent space roadmap for visual action planning of deformable and rigid object manipulation. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020.
  • Rusu et al. [2017] A. A. Rusu, M. Večerík, T. Rothörl, N. Heess, R. Pascanu, and R. Hadsell. Sim-to-real robot learning from pixels with progressive nets. In Conference on Robot Learning, pages 262–270. PMLR, 2017.
  • Schoettler et al. [2020] G. Schoettler, A. Nair, J. Luo, S. Bahl, J. A. Ojea, E. Solowjow, and S. Levine. Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards. Proc. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2020.
  • Singh et al. [2019] A. Singh, L. Yang, K. Hartikainen, C. Finn, and S. Levine. End-to-end robotic reinforcement learning without reward engineering. Proc. Robotics: Science and Systems (RSS), 2019.
  • Zhang et al. [2019] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. Johnson, and S. Levine. Solar: Deep structured representations for model-based reinforcement learning. In International Conference on Machine Learning, pages 7444–7453. PMLR, 2019.
  • Kahn et al. [2017] G. Kahn, A. Villaflor, V. Pong, P. Abbeel, and S. Levine. Uncertainty-aware reinforcement learning for collision avoidance. CoRR, 2017.
  • Srinivas et al. [2018] A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn. Universal planning networks. Proc. Int. Conf. on Machine Learning, 04 2018.
  • Ichter and Pavone [2019] B. Ichter and M. Pavone. Robot motion planning in learned latent spaces. IEEE Robotics and Automation Letters, 4(3):2407–2414, 2019. doi: 10.1109/LRA.2019.2901898.
  • Higgins et al. [2017] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. Proc. Int. Conf. on Learning Representations, 2017.
  • Laskin et al. [2020] M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas. Reinforcement learning with augmented data. 2020. arXiv:2004.14990.
  • Rubinstein [1999] R. Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology and computing in applied probability, 1(2):127–190, 1999.
  • Chua et al. [2018] K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Proc. Advances in Neural Information Processing Systems, 2018.
  • Zhang et al. [2020] J. Zhang, B. Cheung, C. Finn, S. Levine, and D. Jayaraman. Cautious adaptation for reinforcement learning in safety-critical settings. In International Conference on Machine Learning, pages 11055–11065. PMLR, 2020.
  • Kazanzides et al. [2014] P. Kazanzides, Z. Chen, A. Deguet, G. S. Fischer, R. H. Taylor, and S. P. DiMaio.

    An open-source research kit for the da Vinci surgical system.

    In Proc. IEEE Int. Conf. Robotics and Automation (ICRA), 2014.
  • Seita et al. [2018] D. Seita, S. Krishnan, R. Fox, S. McKinley, J. Canny, and K. Goldberg. Fast and reliable autonomous surgical debridement with cable-driven robots using a two-phase calibration procedure. In Proc. IEEE Int. Conf. Robotics and Automation (ICRA), 2018.
  • Haarnoja et al. [2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc. Int. Conf. on Machine Learning, 2018.
  • Nair et al. [2021] A. Nair, A. Gupta, M. Dalal, and S. Levine. Awac: Accelerating online reinforcement learning with offline datasets, 2021.
  • Tassa et al. [2020] Y. Tassa, S. Tunyasuvunakool, A. Muldal, Y. Doron, S. Liu, S. Bohez, J. Merel, T. Erez, T. Lillicrap, and N. Heess. dm-control: Software and tasks for continuous control, 2020.
  • Kazanzides et al. [2014] P. Kazanzides, Z. Chen, A. Deguet, G. S. Fischer, R. H. Taylor, and S. P. DiMaio. An open-source research kit for the da vinci® surgical system. In 2014 IEEE international conference on robotics and automation (ICRA), pages 6434–6439. IEEE, 2014.
  • Chua [2018] K. Chua. Experiment code for ”deep reinforcement learning in a handful of trials using probabilistic dynamics models”., 2018.
  • Pong [2018] V. Pong. rlkit., 2018.
  • Ashwin Balakrishna [2021] B. T. Ashwin Balakrishna. Code for recovery rl., 2021.
  • Sikchi [2021] H. Sikchi. Code for advantage weighted actor critic., 2021.
  • Todorov et al. [2012] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012. URL

7 Appendix

In Appendices 7.1 and 7.2 we discuss algorithmic details and implementation/hyperparameter details respectively for LS and all comparisons. We then provide full details regarding each of the experimental domains and how data is collected in these domains in Appendix 7.3. Finally, in Appendix 7.4 we perform sensitivity experiments and ablations.

7.1 Algorithm Details

In this section, we provide implementation details and additional background information for LS and comparison algorithms.

7.1.1 Latent Space Safe Sets (LS)

We now discuss additional details for each of the components of LS

, including network architectures, training data, and loss functions.

Variational Autoencoders:

We scale all image inputs to a size of before feeding them to the

-VAE, which uses a convolutional neural network for

and a transpose convolutional neural network for . We use the encoder and decoder from Hafner et al. [2]

, but modify the second convolutional layer in the encoder to have a stride of 3 rather than 2. As is standard for

-VAEs, we train with a mean-squared error loss combined with a KL-divergence loss. For a particular observation the loss is


where is modeled using the reparameterization trick.

Probabilistic Dynamics:

As in Chua et al. [44] we train a probabilistic ensemble of neural networks to learn dynamics. Each network has two hidden layers with 128 hidden units. We train these networks with a maximum log-likelihood objective, so for two particular latent states and the corresponding action the loss is as follows for dynamics model with parameter :


When using the learned dynamics model for planning, we use the TS-1 method from Chua et al. [44].

Value Functions:

As discussed in Section 4.3, we train an ensemble of recursively defined value functions to predict long term reward. We represent these functions using fully connected neural networks with 3 hidden layers with 256 hidden units. Similarly to [18], we use separate training objectives during offline and online training. During offline training, we train the function to predict actual discounted cost-to-go on all trajectories in . Hence, for a latent vector , the loss during offline training is given as follows where has parameter :


Then in online training we also store a target network and use it to calculate a temporal difference (TD-1) error,


where are the parameters of a lagged target network and is the policy at the timestep at which was set. We update the target network every 100 updates. In each of these equations, is a discount factor (we use ). Because all episodes end by hitting a time horizon, we found it was beneficial to remove the mask multiplier usually used with TD-1 error losses.

For all simulated experiments we update value functions using only data collected by the suboptimal demonstrator or collected online, ignoring offline data collected with random interactions or offline demonstrations of constraint violating behavior.

Constraint and Goal Estimators:

We represent constraint indicator with a neural network with 3 hidden layers and 256 hidden units for each layer with a binary cross entropy loss with transitions from for unsafe examples and the constraint satisfying states in as safe examples. Similarly, we represent the goal estimator with a neural network with 3 hidden layers and 256 hidden units. This estimator is also trained with a binary cross entropy loss with positive examples from and negative examples sampled from all datasets. For the constraint estimator and goal indicator, training data is sampled uniformly from a replay buffer containing , and .

Safe Set:

The Safe Set classifier is represented with neural network with 3 hidden layers and 256 hidden units. We train the safe set classifier to predict


using a binary cross entropy loss, where is an indicator function indicating whether is part of a successful trajectory. Training data is sampled uniformly from a replay buffer containing all of .

Cross Entropy Method:

We use the cross entropy method to solve the optimization problem in equation 2. We build on the implementation of the cross entropy method provided in [52], which works by sampling

action sequences from a diagonal Gaussian distribution, simulating each one

times over the learned dynamics, and refitting the parameters of the Gaussian on the trajectories with the highest score under equation 2 where constraints are implemented by assigning large negative rewards to trajectories which violate either the Safe Set constraint or user-specified constraints. This process is repeated for to iteratively refine the set of sampled trajectories to optimize equation 2. To improve the optimizer’s efficiency on tasks where subsequent actions are often correlated, we sample a proportion of the optimizer’s candidates at the first iteration from the distribution it learned when planning the last action. To avoid local minima, we sample a proportion uniformly from the action space. See Chua et al. [44] for more details on the cross entropy method as applied to planning over neural network dynamics models.

As mentioned in Section 4.4, we set for the Safe Set classifier adaptively by checking whether there exists at least one plan which satisfies the Safe Set constraint at each CEM iteration. If no such plan exists, we multiply by and re-initialize the optimizer at the first CEM iteration with the new value of . This ensures that is set such that planning back to the Safe Set is possible. We initialize .

7.1.2 Soft Actor-Critic from Demonstrations (SACfD)

We utilize the implementation of the Soft Actor Critic algorithm from [53] and initialize the actor and critic from demonstrations but keep all other hyperparameters the same as the default in the provided implementation. We create a new dataset using only data from the suboptimal demonstrator, and use the data from to behavior clone the actor and initialize the critic using offline bellman backups. We use the same mean-squared loss function for behavior cloning as for the behavior clone policy, but only train the mean of the SAC policy. Precisely, we use the following loss for some policy with parameter : where and are the state and action at timestep of trajectory and . We also experimented with training the SAC critic on all data provided to LS in but found that this hurt performance. We use the architecture from [53] and update neural network weights using an Adam optimizer with a learning rate of . The only hyperparameter for SACfD that we tuned across environments was the reward penalty which was imposed upon constraint violations. For all simulation experiments, we evaluated and report the highest performing value. Accordingly, we use for all experiments except the reacher task, for which we used . We observed that higher values of resulted in worse task performance without significant increase in constraint satisfaction. We hypothesize that since the agent is frozen in the environment upon constraint violations, the resulting loss of rewards from this is sufficient to enable SACfD to avoid constraint violations.

7.1.3 Soft Actor-Critic from Demonstrations with Learned Recovery Zones (SACfD+RRL)

We build on the implementation of the Recovery RL algorithm [14] provided in [54]. We train the safety critic on all offline data from . Recovery RL uses SACfD as its task policy optimization algorithm, and introduces two new hyperparameters: . For each of the simulation environments, we evaluated SACfD+RRL across 3-4 settings and reported results from the highest performing run. Accordingly, for the navigation environment, we use: . For the reacher environment, we use , and we use for the sequential pushing environment. For the cable routing environment, we use .

7.1.4 Advantage Weighted Actor-Critic (AWAC)

To provide a comparison to state of the art offline reinforcement learning algorithms, we evaluate AWAC [49] on the experimental domains in this work. We use the implementation of AWAC from [55]. For all simulation experiments, we evaluated and report the highest performing value. Accordingly, we use for all experiments. We used the default settings from [55] for all other hyperparameters.

7.2 Ls Implementation Details

Parameter Navigation Reacher Sequential Pushing Cable Routing
0.8 0.5 0.8 0.8
0.2 0.2 0.2 0.2
5 3 3 5
20 20 20 20
1000 1000 1000 2000
100 100 100 200
5 5 5 5
32 32 32 32
1.0 1.0 1.0 0.3
Frame Stacking No Yes No No
Batch Size 256 256 256 256
0.99 0.99 0.99 0.99
0.3 0.3 0.9 0.9
Table 3: Hyperparameters for LS

In Table 3, we present the hyperparameters used to train and run LS. We present details for the constraint thresholds and . We also present the planning horizon and VAE KL regularization weight . We present the number of particles sampled over the probabilistic latent dynamics model for a fixed action sequence , which is used to provide an estimated probability of constraint satisfaction and expected rewards. For the cross entropy method, we sample action sequences at each iteration, take the best action sequences and then refit the sampling distribution. This process iterates times. We also report the latent space dimension , whether frame stacking is used as input, training batch size, and discount factor . Finally, we present values of the Safe Set bellman coefficient . For all domains, we scale RGB observations to a size of . For all modules we use the Adam optimizer with a learning rate of , except for dynamics which use a learning rate of .

7.3 Experimental Domain Details

7.3.1 Navigation

The visual navigation domain has 2-D single integrator dynamics with additive Gaussian noise sampled from where . The start position is and goal set is , where is a Euclidean ball centered at with radius . The demonstrations are created by guiding the agent north for 20 timesteps, east for 40 timesteps, and directly towards the goal until the episode terminates. This tuned controller ensures that demonstrations avoid the obstacle and also reach the goal set, but they are very suboptimal. To collect demonstrations of constraint violating behavior, we randomly sample starting points throughout the environment, move in a random direction for 15 time steps, and then move directly towards the obstacle. We do not collect additional data for in this environment. We collect 50 demonstrations of successful behaviors and 50 trajectories containing constraint violating behaviors.

7.3.2 Reacher

The reacher domain is built on the reacher domain provided in the DeepMind Control Suite from [50]. The robot is represented with a planar 2-link arm and the agent supplies torques to each of the 2 joints. Because velocity is not observable from a single frame, algorithms are provided with several stacked frames as input. The start position of the end-effector is fixed and the objective is to navigate the end effector to a fixed goal set on the top left of the workspace without allowing the end effector to enter a large red stay-out zone. To collect data from

we randomly sample starting states in the environment, and then use a PID controller to move towards the constraint. To sample random data that will require the agent to model velocity for accurate prediction, we start trajectories at random places in the environment, and then sample each action from a normal distribution centered around the previous action,

for . We collect 50 demonstrations of successful behavior, 50 trajectories containing constraint violations and 100 short (length ) trajectories or random data.

7.3.3 Sequential Pushing

This sequential pushing environment is implemented in MuJoCo [56], and the robot can specify a desired planar displacement for the end effector position. The goal is to push all 3 blocks backwards by at least some displacement on the table, but constraints are violated if blocks are pushed backwards off of the table. For the sequential pushing environment, demonstrations are created by guiding the end effector to the center of each block and then moving the end effector in a straight line at a low velocity until the block is in the goal set. This same process is repeated for each of the 3 blocks. Data of constraint violations and random transitions for and are collected by randomly switching between a policy that moves towards the blocks and a policy that randomly samples from the action space. We collect 500 demonstrations of successful behavior and 300 trajectories of random and/or constraint violating behavior.

7.3.4 Physical Cable Routing

This task starts with the robot grasping one endpoint of the red cable, and it can make motions with its end effector. The goal is to guide the red cable to intersect with the green goal set while avoiding the blue obstacle. The ground-truth goal and obstacle checks are performed with color masking. LS and all baselines are provided with a segmentation mask of the cable as input. The demonstrator generates trajectories by moving the end effector well over the obstacle and to the right before executing a straight line trajectory to the goal set. This ensures that it avoids the obstacle as there is significant margin to the obstacle, but the demonstrations may not be optimal trajectories for the task. Random trajectories are collected by following a demonstrator trajectory for some random amount of time and then sampling from the action space until the episode hits the time horizon. We collect 420 demonstrations of successful behavior and 150 random trajectories.

7.4 Sensitivity Experiments

Key hyperparameters in LS are the constraint threshold and Safe Set threshold , which control whether the agent decides predicted states are constraint violating or in the Safe Set respectively. We ablate these parameters for the Sequential Pushing environment in Figures 6 and 8. We find that lower values of made the agent less likely to violate constraints as expected. Additionally, we find that higher values of helped constrain exploration more effectively, but too high of a threshold led to poor performance suffered as the agent exploited local maxima in the Safe Set estimation. Finally, we ablate the planning horizon for LS and find that when is too high, Latent Space Safe Sets (LS)can explore too aggressively away from the safe set, leading to poor performance. When is lower, LS explores much more stably, but if it is too low (ie. ), LS is eventually unable to explore significantly new plans, while slightly increasing (ie. ) allows for continuous improvement in performance.

Figure 6: Hyperparameter Sweep for LS Constraint Threshold: Plots show mean and standard error over 10 random seeds for experiments with different settings of on the sequential pushing environment. As expected, we see that without avoiding latent space obstacles (No Constraints) the agent violates constraints more often, while lower thresholds (meaning the planning algorithm is more conservative) generally lead to fewer violations.
Figure 7: Hyperparameter Sweep for LS Safe Set Threshold: Plots show mean and standard error over 10 random seeds for experiments with different settings of on the sequential pushing environment. We see that after offline training, the agent can successfully complete the task only when is high enough to sufficiently guide exploration, and that runs with higher values of are more successful overall.
Figure 8: Hyperparameter Sweep for LS Planning Horizon: Plots show mean and standard error over 10 random seeds for experiments with different settings of on the sequential pushing environment. We see that when the planning horizon is too high the agent cannot reliably complete the task due to modeling errors. When the planning horizon is too low, it learns quickly but cannot significantly improve because it is constrained to the safe set. We found to balance this trade off best.