Author implementation of LS3: Latent Space Safe Sets for Long-Horizon Visuomotor Control of Iterative Tasks
Reinforcement learning (RL) algorithms have shown impressive success in exploring high-dimensional environments to learn complex, long-horizon tasks, but can often exhibit unsafe behaviors and require extensive environment interaction when exploration is unconstrained. A promising strategy for safe learning in dynamically uncertain environments is requiring that the agent can robustly return to states where task success (and therefore safety) can be guaranteed. While this approach has been successful in low-dimensions, enforcing this constraint in environments with high-dimensional state spaces, such as images, is challenging. We present Latent Space Safe Sets (LS3), which extends this strategy to iterative, long-horizon tasks with image observations by using suboptimal demonstrations and a learned dynamics model to restrict exploration to the neighborhood of a learned Safe Set where task completion is likely. We evaluate LS3 on 4 domains, including a challenging sequential pushing task in simulation and a physical cable routing task. We find that LS3 can use prior task successes to restrict exploration and learn more efficiently than prior algorithms while satisfying constraints. See https://tinyurl.com/latent-ss for code and supplementary material.READ FULL TEXT VIEW PDF
Author implementation of LS3: Latent Space Safe Sets for Long-Horizon Visuomotor Control of Iterative Tasks
Visual planning over learned forward dynamics models is a popular area of research in robotic control from images [1, 2, 3, 4, 5, 6, 7], as it enables closed-loop, model-based control for tasks where the state of the system is not directly observable or difficult to analytically model, such as the configuration of a sheet of fabric or segment of cable. These methods learn predictive models over either images or a learned latent space, which are then used by model predictive control (MPC) to optimize image-based task costs. While these approaches have significant promise, there are several open challenges in learning policies from visual observations. First, reward specification is particularly challenging for visuomotor control tasks, because high-dimensional observations often do not expose the necessary features required to design informative reward functions , especially for long-horizon tasks. Second, while many prior reinforcement learning methods have been successfully applied to image-based control tasks [9, 10, 11, 12, 13], learning policies from image observations often requires extensive exploration due to the high dimensionality of the observation space and the difficulties in reward specification, making safe and efficient learning exceedingly challenging.
Safe reinforcement learning is of critical importance in robotics, as unconstrained exploration can cause serious damage to the robot and its surroundings . Safe learning often leads to efficient learning, as safe learning prevents the agent from exploring clearly suboptimal behaviors. There has been significant prior work on safe policy learning in low-dimensional observation spaces [15, 16, 17, 18, 19]. Thananjeyan et al. 
present a safe and efficient algorithm for policy learning for tasks with a low-variance start state distribution and fixed goal state by learning aSafe Set, which captures the set of states from which the agent has previously completed the task. Safe Sets are a common ingredient in classical control algorithms for guaranteeing policy improvement and constraint satisfaction [19, 17, 20], and restricting exploration to the neighborhood of this set can lead to highly efficient and safe learning for long-horizon tasks . Extending these methods to variable start and goal sets in low-dimensional settings has been studied in Thananjeyan et al. . However, scaling these approaches to high-dimensional image observations is challenging, since images do not directly expose details about the system state or dynamics that are typically needed for formal controller analysis [17, 19, 20]. In this work, we study learning in the iterative setting, where the start and goal sets have low variance, and focus on scaling these approaches to image-based inputs. The proposed algorithm makes several practical relaxations and maintains the same safety guarantees under the same additional assumptions as in Thananjeyan et al. [18, 17].
We introduce Latent Space Safe Sets (LS), a model-based RL algorithm for visuomotor policy learning that provides safety by learning a continuous relaxation of a Safe Set in a learned latent space. This Latent Space Safe Set ensures that the agent can plan back to regions in which it is confident in task completion even when learning in high dimensional spaces. This constraint makes it possible to (1) improve safely by ensuring that the agent can consistently complete the task (and therefore avoid unsafe behavior) and (2) learn efficiently since the agent only explores promising states in the immediate neighborhood of those in which it was previously successful. LS
additionally enforces user-specified state space constraints by estimating the probability of constraint violations over a learned, probabilistic, latent space dynamics model. We contribute (1) Latent Space Safe Sets (LS), a novel reinforcement learning algorithm for safely and efficiently learning long-horizon visual planning tasks, (2) simulation experiments on 3 continuous control visuomotor control tasks suggesting that LS can learn to improve upon demonstrations more safely and efficiently than baselines, (3) physical experiments on a vision-based cable routing task on the da Vinci surgical robot suggesting that LS can learn a policy more efficiently than prior algorithms while consistently completing the task and satisfying constraints during learning.
In iterative learning control (ILC), the agent tracks an initially provided reference trajectory and uses data from controller rollouts to iteratively refine tracking performance . Rosolia et al. , Rosolia and Borrelli [23, 19] present a new class of algorithms, known as Learning Model Predictive Control (LMPC), which are reference-free and instead iteratively improve upon the performance of an initial feasible trajectory. To achieve this, Rosolia et al. , Rosolia and Borrelli [23, 19] present model predictive control algorithms that use data from controller rollouts to learn a Safe Set and value function, with which recursive feasibility, stability, and local optimality can be guaranteed given a known, deterministic nonlinear system or stochastic linear system under certain regularity assumptions. However, a core challenge with these algorithms is that they assume that system dynamics are known, and cannot easily be applied to high-dimensional control problems. Thananjeyan et al.  extends the LMPC framework to higher dimensional control settings in which system dynamics are unknown and must be estimated iteratively from experience, but the visuomotor control setting introduces a number of new challenges for iterative learning control algorithms such as learning system dynamics, Safe Sets, and value functions which can flexibly and efficiently accommodate visual inputs.
There has been significant recent progress in algorithms which combine ideas from model-based planning and control with deep learning[24, 25, 26, 27, 28, 29]. These algorithms are gaining popularity in the robotics community as they enable leaning complex policies from data while maintaining some of the sample efficiency and safety benefits of classical model-based control techniques. However, these algorithms typically require hand-engineered dense cost functions for task specification, which can often be difficult to provide, especially in high-dimensional spaces. This motivates leveraging demonstrations (possibly suboptimal) to provide an initial signal regarding desirable agent behavior. There has been some prior work on leveraging demonstrations in model-based algorithms such as Quinlan and Khatib  and Ichnowski et al. , which use model-based control with known dynamics to refine initially suboptimal motion plans, and Fu et al. , which uses demonstrations to seed a learned dynamics model for fast online adaptation using iLQR . Thananjeyan et al. , Zhu et al.  present ILC algorithms which rapidly improve upon suboptimal demonstrations when system dynamics are unknown. However, these algorithms either require knowledge of system dynamics [30, 31] or are limited to low-dimensional state spaces [26, 18, 32] and cannot be flexibly applied to visuomotor control tasks.
Reinforcement learning and model-based planning from visual observations is gaining significant recent interest as RGB images provide an easily available observation space for robot learning [1, 33]. Recent work has proposed a number of model-free and model-based algorithms that have seen success in laboratory settings in a number of robotic tasks when learning from visual observations [34, 35, 10, 36, 12, 13, 1, 37, 33]. However, two core issues that prevent application of many RL algorithms in practice, inefficient exploration and safety, are significantly exacerbated when learning from high-dimensional visual observations in which the space of possible behaviors is very large and the features required to determine whether the robot is safe are not readily exposed. There has been significant prior work on addressing inefficiencies in exploration for visuomotor control such as latent space planning [2, 33, 37] and goal-conditioned reinforcement learning [13, 10]. However, safe reinforcement learning for visuomotor tasks has received substantially less attention. Thananjeyan et al.  and Kahn et al.  present reinforcement learning algorithms which estimate the likelihood of constraint violations to avoid them  or reduce the robot’s velocity . Unlike these algorithms, which focus on presenting methods to avoid violating user-specified constraints, LS additionally provides consistent task completion during learning by limiting exploration to the neighborhood of prior task successes. This difference makes LS less susceptible to the challenges of unconstrained exploration present in standard model-free reinforcement learning algorithms.
We consider an agent interacting in a finite horizon goal-conditioned Markov Decision Processes (MDP) which can be described with the tuple. and are the state and action spaces,
maps a state and action to a probability distribution over subsequent states,is the reward function, is the initial state distribution (), and is the time horizon. In this work, the agent is only provided with RGB image observations , where and are the image width and height in pixels, respectively. We consider tasks in the iterative learning control setting, where the agent must reach goal set as efficiently as possible and the support of is small. While there are a number of possible choices of reward functions that would encourage fast convergence to , providing shaped reward functions can be exceedingly challenging, especially when learning complex tasks in which agents are only provided with high dimensional observations. Thus, as in Thananjeyan et al. , we consider a sparse reward function that only provides a signal upon task completion: if and otherwise. To incorporate constraints, we augment with an extra constraint indicator function which indicates whether a state satisfies user-specified state-space constraints, such as avoiding known obstacles. This is consistent with the modified CMDP formulation used in . We assume that and can be evaluated on the current state of the system, but cannot be used for planning. We make this assumption because in practice we plan over predicted future states, which may not be predicted at sufficiently high fidelity to expose the necessary information to directly evaluate and during planning.
Given a policy , its expected total return in can be defined as . Furthermore, we define as the probability of future constraint violation (within time horizon ) under policy from state . Then the objective is to maximize the expected return while maintaining a constraint violation probability lower than some . This can be written formally as follows:
We assume that the agent is provided with an offline dataset of transitions in the environment of which some subset are constraint violating and some subset appear in successful demonstrations from a suboptimal supervisor. As in , contains examples of constraint violating behaviors (for example from prior runs of different policies or collected under human supervision) so that the agent can learn about states which violate user-specified constraints.
Here we describe how LS uses demonstrations and online environment interaction to safely learn iteratively improving policies. Section 4.1 describes how we learn a low-dimensional latent representation of image observations to facilitate efficient model-based planning. To enable this planning, we learn a probabilistic forward dynamics model as in  in the learned latent space and models to estimate whether plans will likely complete the task (Section 4.2) and to estimate future rewards and constraint violations (Section 4.3) from predicted trajectories. Finally, in Section 4.4, we discuss how all of these components are combined in LS to enable safe and efficient policy improvement. Dataset is expanded using online rollouts of LS and used to update all latent space models (Sections 4.2 and 4.3) after every rollouts. See Algorithm 1 and the supplement for further details on training procedures and data collection for all components.
Learning compressed representations of images has been a popular approach in vision based control to facilitate efficient algorithms for planning and control which can reason about lower dimensional inputs [2, 37, 6, 39, 40, 33]. To learn such a representation, we train a
-variational autoencoder on states in to map states to a probability distribution over a -dimensional latent space . The resulting encoder network is then used to sample latent vectors to train a forward dynamics model, value function, reward estimator, constraint classifier, Safe Set, and combine these elements to define a policy for model-based planning. Motivated by Laskin et al. , during training we augment inputs to the encoder with random cropping, which we found to be helpful in learning representations that are useful for planning. For all environments we use a latent dimension of , as in  and found that varying did not significantly affect performance.
LS learns a binary classifier for latent states to learn a latent space Safe Set that represents states from which the agent has high confidence in task completion based on prior experience. Because the agent can reach the goal from these states, they are safe: the agent can avoid constraint violations by simply completing the task as before. While classical algorithms use known dynamics to construct Safe Sets, we approximate this set using successful trajectories from prior iterations. At each iteration, the algorithm collects trajectories in the environment. Let denote the set of states in trajectory of iteration : . Let denote the set of tuples such that trajectory of was successful: }. We define the sampled Safe Set at iteration , as follows: . In short, this is the set of states from which the agent has successfully navigated to at iteration of training.
This discrete set is difficult to plan to with continuous-valued state distributions so we leverage data from (data in the discrete Safe Set), data from (data outside the Safe Set), and the learned encoder from Section 4.1
to learn a continuous relaxation of this set in latent space (the Latent Safe Set). We train a neural network with a binary cross-entropy loss to learn a binary classifierthat predicts the probability of a state with encoding being in . To mitigate the negative bias that appears when trajectories that start in safe regions fail, we utilize the intuition that if a state then it is likely that is also safe. To do this, rather than just predict , we train to predict . The relaxed Latent Safe Set is parameterized by the superlevel sets of , where the level is adaptively set during execution: .
In this work, we define rewards based on whether the agent has reached a state , but we need rewards that are defined on predictions from the dynamics, which may not correspond to valid real images. To address this, we train a classifier to map the encoding of a state to whether the state is contained in using terminal states in (which are known to be in ) and other states in . However, in the temporally-extended, sparse reward tasks we consider, reward prediction alone is insufficient because rewards only indicate whether the agent is in the goal set, and thus provide no signal on task progress unless the agent can plan to the goal set. To address this, as in prior MPC-literature [18, 17, 19, 8], we train a recursively-defined value function (details in the supplement). Similar to the reward function, we use the encoder (Section 4.1) to train a classifier with data of constraint violating states from and the constraint satisfying states in to map the encoding of a state to the probability of constraint violation.
LS aims to maximize total rewards attained in the environment while limiting constraint violation probability within some threshold (equation 1). We optimize an approximation of this objective over an -step receding horizon with model-predictive control. Precisely, LS solves the following optimization problem to generate an action to execute at timestep :
In this problem, the expectations and probabilities are taken with respect to the learned, probabilistic dynamics model . The optimization problem is solved approximately using the cross-entropy method (CEM)  which is a popular optimizer in model-based RL [44, 18, 17, 45, 14].
The objective function is the expected sum of future rewards if the agent executes and then subsequently executes (equation 2). First, the current state is encoded to (equation 3). Then, for a candidate sequence of actions , an -step latent trajectory is sampled from the learned dynamics (equation 4). LS constrains exploration using two chance constraints: (1) the terminal latent state in the plan must fall in the Safe Set (equation 5) and (2) all latent states in the trajectory must satisfy user-specified state-space constraints (equation 6). is the set of all latent states such that the corresponding observation is constraint violating. The optimizer estimates constraint satisfaction probabilities for a candidate action sequence by simulating it repeatedly over . The first chance constraint ensures the agent maintains the ability to return to safe states where it knows how to the task within steps if necessary. Because the agent replans at each timestep, the agent need not return to the Safe Set: during training, the Safe Set expands, enabling further exploration. In practice, we set for the Safe Set classifier adaptively as described in the supplement. The second chance constraint encourages constraint violation probability of no more than . After solving the optimization problem, the agent executes the first action in the plan: where is the first element of , observes a new state, and replans.
We evaluate LS on 3 robotic control tasks in simulation and a physical cable routing task on the da Vinci Research Kit (dVRK) . Safe RL is of particular interest for surgical robots such as the dVRK due to its delicate structure, motivating safety, and relatively imprecise controls [18, 47], motivating closed-loop control. We study whether LS can learn more safely and efficiently than algorithms that do not structure exploration based on prior task successes.
We evaluate LS in comparison to prior algorithms that behavior clone suboptimal demonstrations before exploring online (SACfD)  or leverage offline reinforcement learning to learn a policy using all offline data before updating the policy online (AWAC) . For both of these comparisons we enforce constraints via a tuned reward penalty of for constraint violations as in . We also implement a version of SACfD with a learned recovery policy (SACfD+RRL) using the Recovery RL algorithm  to use prior constraint violating data to try to avoid constraint violating states. Finally, we compare LS to an ablated version without the Safe Set constraint in equation 5 (LS (Safe Set)
) to evaluate if the Safe Set promotes consistent task completion and stable learning. See the supplement for details on hyperparameters and offline data used for LSand prior algorithms.
Learning curves showing mean and standard error over 10 random seeds. We see that LSconsistently learns more quickly than baselines, as well as the ablated algorithm without the Safe Set. Although SACfD and SACfD+RRL eventually achieve similar reward values, we see that LS is much more sample efficient and stable across random seeds.
For each algorithm on each domain, we aggregate statistics over random seeds (10 for simulation experiments, 3 for the physical experiment), reporting the mean and standard error across the seeds. We present learning curves that show the total sum reward for each training trajectory to study how efficiently LS and the comparisons learn each task. Because all tasks use the sparse task completion based rewards defined in Section 3, the total reward for a trajectory is the time to reach the goal set, where more negative rewards correspond to slower convergence to . Thus, for a task with task horizon , a total reward greater than implies successful task completion. The state is frozen in place upon constraint violation until the task horizon elapses. We report task success and constraint satisfaction rates for LS and comparisons to study if algorithms consistently complete the task during learning and satisfy user-specified state-space constraints. LS collects trajectories in between training phases on simulated tasks and in between training phases for physical tasks, while the SACfD and AWAC comparisons update their parameters after each timestep. This presents a calibrated metric in terms of the amount of data collected across algorithms.
In simulation, we evaluate LS on 3 vision-based continuous control domains that are illustrated in Figure 3. We evaluate LS and comparisons on a constrained visual navigation task (Pointmass Navigation) where the agent navigates from a fixed start state to a fixed goal set while avoiding a large central obstacle. We study this domain to gain intuition and visualize the learned value function, goal/constraint indicators, and Safe Set in Figure 2. We then study a constrained image-based reaching task (Reacher) based on , where the objective is to navigate the end effector of a 2-link planar robotic arm to a yellow goal position without the end-effector entering a red stay out zone. We then study a challenging sequential image-based robotic pushing domain (Sequential Pushing), in which the objective is to push each of the 3 blocks forward on the table without pushing them to either side and causing them to fall out of the workspace. Finally, we evaluate LS with an image-based physical experiment on the da Vinci Research Kit (dVRK)  (Figure 3), where the objective is to guide the endpoint of a cable to a goal region without letting the cable or end effector collide with an obstacle. The Pointmass Navigation and Reaching domains have a task horizon of while the Sequential Pushing domain and physical experiment have task horizons of and respectively. See the supplement for more details on all domains.
We find that LS is able to learn more stably and efficiently than all comparisons across all simulated domains while converging to similar performance within 250 trajectories collected online (Figure 4). LS is able to consistently complete the task during learning, while the comparisons, which do not learn a Safe Set to structure exploration based on prior successes, exhibit much less stable learning. Additionally, in Table 2 and Table 2, we report the task success rate and constraint violation rate of all algorithms during training. We find that LS achieves a significantly higher task success rate than comparisons on all tasks. We also find that LS violates constraints less often than comparisons on the Reacher task, but violates constraints more often than SACfD and SACfD+RRL on the other domains, likely due to SACfD and SACfD+RRL spending much less time in the neighborhood of constraint violating states during training due to their lower task success rates. We find that the AWAC comparison achieves very low task performance. While AWAC is designed for offline reinforcement learning, to the best of our knowledge, it has not been previously evaluated on long-horizon, image-based tasks as in this paper, which we hypothesize are very challenging for it.
We find LS has a lower success rate when the Safe Set constraint is removed (LS(Safe Set)) as expected. The Safe Set is particularly important in the sequential pushing task, and LS (Safe Set) has a much lower task completion rate than LS. See the supplement for details on experimental parameters and offline data used for LS and comparisons and ablations studying the effect of the planning horizon and threshold used to define the Safe Set.
In physical experiments, we compare LS to SACfD and SACfD+RRL (Figure 5) on the physical cable routing task illustrated in Figure 3. We find LS quickly outperforms the suboptimal demonstrations while succeeding at the task significantly more often than both comparisons, which are unable to learn the task and also violate constraints more than LS. We hypothesize that the difficulty of reasoning about cable collisions and deformation from images makes it challenging for prior algorithms to make sufficient task progress as they do not use prior successes to structure exploration. See the supplement for details on experimental parameters and offline data used for LS and comparisons. We use data augmentation to increase the size of the dataset used to train and , taking the images in and creating an expanded dataset by adding randomly sampled affine translations and perspective shifts, until .
We present LS, a scalable algorithm for safe and efficient policy learning for visuomotor tasks. LS structures exploration by learning a safe set in a learned latent space, which captures the set of states from which the agent is confident in task completion. LS then ensures that the agent can plan back to states in safe set, encouraging consistent task completion the task during learning. Experiments suggest that LS is able to use this procedure to safely and efficiently learn 4 visuomotor control tasks, including a challenging sequential pushing task in simulation and a cable routing task on a physical robot. In future work, we are excited to explore further physical evaluation of LS on safety critical visuomotor control tasks such as navigation of home support robots or surgical automation.
Proc. Int. Conf. on Machine Learning, 2019.
Hierarchical foresight: Self-supervised learning of long-horizon tasks via visual subgoal generation.Proc. Int. Conf. on Learning Representations, 2019.
An open-source research kit for the da Vinci surgical system.In Proc. IEEE Int. Conf. Robotics and Automation (ICRA), 2014.
In Appendices 7.1 and 7.2 we discuss algorithmic details and implementation/hyperparameter details respectively for LS and all comparisons. We then provide full details regarding each of the experimental domains and how data is collected in these domains in Appendix 7.3. Finally, in Appendix 7.4 we perform sensitivity experiments and ablations.
In this section, we provide implementation details and additional background information for LS and comparison algorithms.
We now discuss additional details for each of the components of LS
, including network architectures, training data, and loss functions.
We scale all image inputs to a size of before feeding them to the
-VAE, which uses a convolutional neural network forand a transpose convolutional neural network for . We use the encoder and decoder from Hafner et al. 
, but modify the second convolutional layer in the encoder to have a stride of 3 rather than 2. As is standard for-VAEs, we train with a mean-squared error loss combined with a KL-divergence loss. For a particular observation the loss is
where is modeled using the reparameterization trick.
As in Chua et al.  we train a probabilistic ensemble of neural networks to learn dynamics. Each network has two hidden layers with 128 hidden units. We train these networks with a maximum log-likelihood objective, so for two particular latent states and the corresponding action the loss is as follows for dynamics model with parameter :
When using the learned dynamics model for planning, we use the TS-1 method from Chua et al. .
As discussed in Section 4.3, we train an ensemble of recursively defined value functions to predict long term reward. We represent these functions using fully connected neural networks with 3 hidden layers with 256 hidden units. Similarly to , we use separate training objectives during offline and online training. During offline training, we train the function to predict actual discounted cost-to-go on all trajectories in . Hence, for a latent vector , the loss during offline training is given as follows where has parameter :
Then in online training we also store a target network and use it to calculate a temporal difference (TD-1) error,
where are the parameters of a lagged target network and is the policy at the timestep at which was set. We update the target network every 100 updates. In each of these equations, is a discount factor (we use ). Because all episodes end by hitting a time horizon, we found it was beneficial to remove the mask multiplier usually used with TD-1 error losses.
For all simulated experiments we update value functions using only data collected by the suboptimal demonstrator or collected online, ignoring offline data collected with random interactions or offline demonstrations of constraint violating behavior.
We represent constraint indicator with a neural network with 3 hidden layers and 256 hidden units for each layer with a binary cross entropy loss with transitions from for unsafe examples and the constraint satisfying states in as safe examples. Similarly, we represent the goal estimator with a neural network with 3 hidden layers and 256 hidden units. This estimator is also trained with a binary cross entropy loss with positive examples from and negative examples sampled from all datasets. For the constraint estimator and goal indicator, training data is sampled uniformly from a replay buffer containing , and .
The Safe Set classifier is represented with neural network with 3 hidden layers and 256 hidden units. We train the safe set classifier to predict
using a binary cross entropy loss, where is an indicator function indicating whether is part of a successful trajectory. Training data is sampled uniformly from a replay buffer containing all of .
action sequences from a diagonal Gaussian distribution, simulating each onetimes over the learned dynamics, and refitting the parameters of the Gaussian on the trajectories with the highest score under equation 2 where constraints are implemented by assigning large negative rewards to trajectories which violate either the Safe Set constraint or user-specified constraints. This process is repeated for to iteratively refine the set of sampled trajectories to optimize equation 2. To improve the optimizer’s efficiency on tasks where subsequent actions are often correlated, we sample a proportion of the optimizer’s candidates at the first iteration from the distribution it learned when planning the last action. To avoid local minima, we sample a proportion uniformly from the action space. See Chua et al.  for more details on the cross entropy method as applied to planning over neural network dynamics models.
As mentioned in Section 4.4, we set for the Safe Set classifier adaptively by checking whether there exists at least one plan which satisfies the Safe Set constraint at each CEM iteration. If no such plan exists, we multiply by and re-initialize the optimizer at the first CEM iteration with the new value of . This ensures that is set such that planning back to the Safe Set is possible. We initialize .
We utilize the implementation of the Soft Actor Critic algorithm from  and initialize the actor and critic from demonstrations but keep all other hyperparameters the same as the default in the provided implementation. We create a new dataset using only data from the suboptimal demonstrator, and use the data from to behavior clone the actor and initialize the critic using offline bellman backups. We use the same mean-squared loss function for behavior cloning as for the behavior clone policy, but only train the mean of the SAC policy. Precisely, we use the following loss for some policy with parameter : where and are the state and action at timestep of trajectory and . We also experimented with training the SAC critic on all data provided to LS in but found that this hurt performance. We use the architecture from  and update neural network weights using an Adam optimizer with a learning rate of . The only hyperparameter for SACfD that we tuned across environments was the reward penalty which was imposed upon constraint violations. For all simulation experiments, we evaluated and report the highest performing value. Accordingly, we use for all experiments except the reacher task, for which we used . We observed that higher values of resulted in worse task performance without significant increase in constraint satisfaction. We hypothesize that since the agent is frozen in the environment upon constraint violations, the resulting loss of rewards from this is sufficient to enable SACfD to avoid constraint violations.
We build on the implementation of the Recovery RL algorithm  provided in . We train the safety critic on all offline data from . Recovery RL uses SACfD as its task policy optimization algorithm, and introduces two new hyperparameters: . For each of the simulation environments, we evaluated SACfD+RRL across 3-4 settings and reported results from the highest performing run. Accordingly, for the navigation environment, we use: . For the reacher environment, we use , and we use for the sequential pushing environment. For the cable routing environment, we use .
To provide a comparison to state of the art offline reinforcement learning algorithms, we evaluate AWAC  on the experimental domains in this work. We use the implementation of AWAC from . For all simulation experiments, we evaluated and report the highest performing value. Accordingly, we use for all experiments. We used the default settings from  for all other hyperparameters.
|Parameter||Navigation||Reacher||Sequential Pushing||Cable Routing|
In Table 3, we present the hyperparameters used to train and run LS. We present details for the constraint thresholds and . We also present the planning horizon and VAE KL regularization weight . We present the number of particles sampled over the probabilistic latent dynamics model for a fixed action sequence , which is used to provide an estimated probability of constraint satisfaction and expected rewards. For the cross entropy method, we sample action sequences at each iteration, take the best action sequences and then refit the sampling distribution. This process iterates times. We also report the latent space dimension , whether frame stacking is used as input, training batch size, and discount factor . Finally, we present values of the Safe Set bellman coefficient . For all domains, we scale RGB observations to a size of . For all modules we use the Adam optimizer with a learning rate of , except for dynamics which use a learning rate of .
The visual navigation domain has 2-D single integrator dynamics with additive Gaussian noise sampled from where . The start position is and goal set is , where is a Euclidean ball centered at with radius . The demonstrations are created by guiding the agent north for 20 timesteps, east for 40 timesteps, and directly towards the goal until the episode terminates. This tuned controller ensures that demonstrations avoid the obstacle and also reach the goal set, but they are very suboptimal. To collect demonstrations of constraint violating behavior, we randomly sample starting points throughout the environment, move in a random direction for 15 time steps, and then move directly towards the obstacle. We do not collect additional data for in this environment. We collect 50 demonstrations of successful behaviors and 50 trajectories containing constraint violating behaviors.
The reacher domain is built on the reacher domain provided in the DeepMind Control Suite from . The robot is represented with a planar 2-link arm and the agent supplies torques to each of the 2 joints. Because velocity is not observable from a single frame, algorithms are provided with several stacked frames as input. The start position of the end-effector is fixed and the objective is to navigate the end effector to a fixed goal set on the top left of the workspace without allowing the end effector to enter a large red stay-out zone. To collect data from
we randomly sample starting states in the environment, and then use a PID controller to move towards the constraint. To sample random data that will require the agent to model velocity for accurate prediction, we start trajectories at random places in the environment, and then sample each action from a normal distribution centered around the previous action,for . We collect 50 demonstrations of successful behavior, 50 trajectories containing constraint violations and 100 short (length ) trajectories or random data.
This sequential pushing environment is implemented in MuJoCo , and the robot can specify a desired planar displacement for the end effector position. The goal is to push all 3 blocks backwards by at least some displacement on the table, but constraints are violated if blocks are pushed backwards off of the table. For the sequential pushing environment, demonstrations are created by guiding the end effector to the center of each block and then moving the end effector in a straight line at a low velocity until the block is in the goal set. This same process is repeated for each of the 3 blocks. Data of constraint violations and random transitions for and are collected by randomly switching between a policy that moves towards the blocks and a policy that randomly samples from the action space. We collect 500 demonstrations of successful behavior and 300 trajectories of random and/or constraint violating behavior.
This task starts with the robot grasping one endpoint of the red cable, and it can make motions with its end effector. The goal is to guide the red cable to intersect with the green goal set while avoiding the blue obstacle. The ground-truth goal and obstacle checks are performed with color masking. LS and all baselines are provided with a segmentation mask of the cable as input. The demonstrator generates trajectories by moving the end effector well over the obstacle and to the right before executing a straight line trajectory to the goal set. This ensures that it avoids the obstacle as there is significant margin to the obstacle, but the demonstrations may not be optimal trajectories for the task. Random trajectories are collected by following a demonstrator trajectory for some random amount of time and then sampling from the action space until the episode hits the time horizon. We collect 420 demonstrations of successful behavior and 150 random trajectories.
Key hyperparameters in LS are the constraint threshold and Safe Set threshold , which control whether the agent decides predicted states are constraint violating or in the Safe Set respectively. We ablate these parameters for the Sequential Pushing environment in Figures 6 and 8. We find that lower values of made the agent less likely to violate constraints as expected. Additionally, we find that higher values of helped constrain exploration more effectively, but too high of a threshold led to poor performance suffered as the agent exploited local maxima in the Safe Set estimation. Finally, we ablate the planning horizon for LS and find that when is too high, Latent Space Safe Sets (LS)can explore too aggressively away from the safe set, leading to poor performance. When is lower, LS explores much more stably, but if it is too low (ie. ), LS is eventually unable to explore significantly new plans, while slightly increasing (ie. ) allows for continuous improvement in performance.