Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience

10/12/2018 ∙ by Yevgen Chebotar, et al. ∙ University of Southern California 0

We consider the problem of transferring policies to the real world by training on a distribution of simulated scenarios. Rather than manually tuning the randomization of simulations, we adapt the simulation parameter distribution using a few real world roll-outs interleaved with policy training. In doing so, we are able to change the distribution of simulations to improve the policy transfer by matching the policy behavior in simulation and the real world. We show that policies trained with our method are able to reliably transfer to different robots in two real world tasks: swing-peg-in-hole and opening a cabinet drawer. The video of our experiments can be found at https://sites.google.com/view/simopt

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Learning continuous control in real world complex environments has seen a wide interest in the past few years and in particular focusing on learning policies in simulators and transferring to real world, as we still struggle with finding ways to acquire the necessary amount of experience and data in the real world directly. While there have been recent attempts on learning by collecting large scale data directly on real robots [1, 2, 3, 4], such an approach still remains challenging as collecting real world data is prohibitively laborious and expensive. Simulators offer several advantages, e.g. they can run faster than real-time and allow for acquiring large diversity of training data. However, due to the imprecise simulation models and lack of high fidelity replication of real world scenes, policies learned in simulations often cannot be directly applied on real-world systems, a phenomenon also known as the reality gap [5]. In this work, we focus on closing the reality gap by learning policies on distributions of simulated scenarios that are optimized for a better policy transfer.

Training policies on a large diversity of simulated scenarios by randomizing relevant parameters, also known as domain randomization, has shown a considerable promise for the real world transfer in a range of recent works [6, 7, 8, 9]

. However, design of the appropriate simulation parameter distributions remains a tedious task and often requires a substantial expert knowledge. Moreover, there are no guarantees that the applied randomization would actually lead to a sensible real world policy as the design choices made in randomizing the parameters tend to be somewhat biased by the expertise of the practitioner. In this work, we apply a data-driven approach and use real world data to adapt simulation randomization such that the behavior of the policies trained in simulation better matches their behavior in the real world. Therefore, starting with some initial distribution of the simulation parameters, we can perform learning in simulation and use real world roll-outs of learned policies to gradually change the simulation randomization such that the learned policies transfer better to the real world without requiring the exact replication of the real world scene in simulation. This approach falls into the domain of model-based reinforcement learning. However, we leverage recent developments in physics simulations to provide a strong prior of the world model in order to accelerate the learning process. Our system uses partial observations of the real world and only needs to compute rewards in simulation, therefore lifting the requirement for full state knowledge or reward instrumentation in the real world.

Ii Related Work

The problem of finding accurate models of the robot and the environment that can facilitate the design of robotic controllers in real world dates back to the original works on system identification [10]. In the context of reinforcement learning (RL), model-based RL explored optimizing policies using learned models [11]. In [12, 13], the data from real-world policy executions is used to fit a probabilistic dynamics model, which is then used for learning an optimal policy. Although our work follows the general principle of model-based reinforcement learning, we aim at using a simulation engine as a form of parameterized model that can help us to embed prior knowledge about the world.

Overcoming the discrepancy between simulated models and the real world has been addressed through identifying simulation parameters [14], finding common feature representations of real and synthetic data [15], using generative models to make synthetic images more realistic [16], fine-tuning the policies trained in simulation in the real world [17], learning inverse dynamics models [18], multi-objective optimization of task fitness and transferability to the real world [19], training on ensembles of dynamics models [20] and training on a large variety of simulated scenarios [6]. Domain randomization of textures was used in [7] to learn to fly a real quadcopter by training an image based policy entirely in simulation. Peng et al. [21] use randomization of physical parameters of the scene to learn a policy in simulation and transfer it to real robot for pushing a puck to a target position. In [9], randomization of physical properties and object appearance is used to train a dexterous robotic hand to perform in-hand manipulation. Yu et al. [22] propose to not only train a policy on a distribution of simulated parameters, but also learn a component that predicts the system parameters from the current states and actions, and use the prediction as an additional input to the policy. In [23]

, an upper confidence bound on the estimated simulation optimization bias is used as a stopping criterion for a robust training with domain randomization. In

[24], an auxiliary reward is used to encourage policies trained in source and target environments to visit the same states.

Combination of system identification and dynamics randomization has been used in the past to learn locomotion for a real quadruped [25], non-prehensile object manipulation [26] and in-hand object pivoting [27]. In our work, we recognize domain randomization and system identification as powerful tools for training general policies in simulation. However, we address the problem of automatically learning simulation parameter distributions that improve policy transfer, as it remains challenging to do it manually. Furthermore, as also noticed in [28], simulators have an advantage of providing a full state of the system compared to partial observations of the real world, which is also used in our work for designing better reward functions.

The closest to our approach are the methods from [29, 30, 31, 32, 33] that propose to iteratively learn simulation parameters and train policies. In [29], an iterative system identification framework is used to optimize trajectories of a bipedal robot in simulation and calibrate the simulation parameters by minimizing the discrepancy between the real world and simulated execution of the trajectories. Although we also use the real world data to compute the discrepancy of the simulated executions, we are able to use partial observations of the real world instead of the full states and we concentrate on learning general policies by finding simulation parameter distribution that leads to a better transfer without the need for exact replication of the real world environment. [30] suggests to optimize the simulation parameters such that the value function is well approximated in simulation without replicating the real world dynamics. We also recognize that exact replication of the real world dynamics might not be feasible, however a suitable randomization of the simulated scenarios can still lead to a successful policy transfer. In addition, our approach does not require estimating the reward in the real world, which might be challenging if some of the reward components can not be observed. [31] and [32] consider grounding the simulator using real world data. However, [31] requires a human in the loop to select the best simulation parameters, and [32] needs to fit additional models for the real robot forward dynamics and simulator inverse dynamics. Finally, our work is closest to the adaptive EPOpt framework of Rajeswaran et al. [33]

, which optimizes a policy over an ensemble of models and adapts the model distribution using data from the target domain. EPOpt optimizes a risk-sensitive objective to obtain robust policies, whereas we optimize the average performance which is a risk-neutral objective. Additionally, EPOpt updates the model distribution by employing Bayesian inference with a particle filter, whereas we update the model distribution using an iterative KL-divergence constrained procedure. More importantly, they focus on simulated environments while in our work, we develop an approach that is shown to work in real world and apply it to two real robot tasks.

Iii Closing the Sim-to-Real Loop

Iii-a Simulation randomization

Let

be a finite-horizon Markov Decision Process (MDP), where

and are state and action spaces,

is a state-transition probability function or probabilistic system dynamics,

a reward function, an initial state distribution, a reward discount factor, and a fixed horizon. Let be a trajectory of states and actions and the trajectory reward. The goal of reinforcement learning methods is to find parameters of a policy that maximize the expected discounted reward over trajectories induced by the policy: where and .

In our work, the system dynamics are either induced by a simulation engine or real world. As the simulation engine itself is deterministic, a reparameterization trick [34] can be applied to introduce probabilistic dynamics. In particular, we define a distribution of simulation parameters parameterized by . The resulting probabilistic system dynamics of the simulation engine are .

As it was shown in [6, 7, 9], it is possible to design a distribution of simulation parameters , such that a policy trained on would perform well on a real-world dynamics distribution. This approach is also known as domain randomization and the policy training maximizes the expected reward under the dynamics induced by the distribution of simulation parameters :

(1)

Domain randomization requires a significant expertise and tedious manual fine-tuning to design the simulation parameter distribution . Furthermore, as we show in our experiments, it is often disadvantageous to use overly wide distributions of simulation parameters as they can include scenarios with infeasible solutions that hinder successful policy learning, or lead to exceedingly conservative policies. Instead, in the next section, we present a way to automate the learning of that makes it possible to shape a suitable randomization without the need to train on very wide distributions.

Iii-B Learning simulation randomization

Fig. 1: The pipeline for optimizing the simulation parameter distribution. After training a policy on current distribution, we sample the policy both in the real world and for a range of parameters in simulation. The discrepancy between the simulated and real observations is used to update the simulation parameter distribution in SimOpt.

The goal of our framework is to find a distribution of simulation parameters that brings observations or partial observations induced by the policy trained under this distribution closer to the observations of the real world. Let be a policy trained under the simulated dynamics distribution as in the objective (1), and let be a measure of discrepancy between real world observation trajectories and simulated observation trajectories sampled using policy and the dynamics distribution . It should be noted that the inputs of the policy and observations used to compute are not required to be the same. The goal of optimizing the simulation parameter distribution is to minimize the following objective:

(2)

This optimization would entail training and real robot evaluation of the policy for each . This would require a large amount of RL iterations and more critically real robot trials. Hence, we develop an iterative approach to approximate the optimization by training a policy on the simulation parameter distribution from the previous iteration and using it for both, sampling the real world observations and optimizing the new simulation parameter distribution :

(3)

where we introduce a KL-divergence step between the old simulation parameter distribution and the updated distribution to avoid going out of the trust region of the policy trained on the old simulation parameter distribution. Fig. 1 shows the general structure of our algorithm that we call SimOpt.

1:  
2:  
3:  for iteration  do
4:     
5:     
6:     
7:     
8:     
9:     
10:     
Algorithm 1 SimOpt framework

Iii-C Implementation

Here we describe particular implementation choices for the components of our framework used in this work. However, it should be noted that each of the components is replaceable. Algorithm 1 describes the order of running all the components in our implementation.

The RL training is performed on a GPU based simulator using a parallelized version of proximal policy optimization (PPO) [35] on a multi-GPU cluster [36]. We parameterize our simulation parameter distribution as a Gaussian, i.e. with . We choose weighted   and   norms between simulation and real world observations for our observation discrepancy function :

(4)

where and are the weights of the   and   norms, and are the importance weights for each observation dimension. We additionally apply a Gaussian filter to the distance computation to account for misalignments of the trajectories.

As we use a non-differentiable simulator we employ a sampling-based gradient-free algorithm based on relative entropy policy search [37] for optimizing the objective (3), which is able to perform updates of with an upper bound on the KL-divergence step. By doing so, the simulator can be treated as a black-box, as in this case can be optimized directly by only using samples and the corresponding costs coming from . Sampling of simulation parameters and the corresponding policy roll-outs is highly parallelizable, which we use in our experiments to evaluate large amounts of simulation parameter samples.

As noted above, single components of our framework can be exchanged. In case of availability of a differentiable simulator, the objective (3

) can be defined as a loss function for optimizing with gradient descent. Furthermore, for cases where

and norms are not applicable, we can employ other forms of discrepancy functions, e.g. to account for potential domain shifts between observations [15, 38, 39]. Alternatively, real world and simulation data can be additionally used to train

to discriminate between the observations by minimizing the prediction loss of classifying observations as simulated or real, similar to the discriminator training in the generative adversarial framework 

[40, 41, 42]. Finally, a higher-dimensional generative model can be employed to provide a multi-modal randomization of the simulated environments.

Iv Experiments

In our experiments we aim at answering the following questions: (1) How does our method compare to pure domain randomization? (2) How learning a simulation parameter distribution compares to training on a very wide parameter distribution? (3) How many SimOpt iterations and real world trials are required for a successful transfer of robotic manipulation policies? (4) Does our method work for different real world tasks and robots?

We start by performing an ablation study in simulation by transferring policies between scenes with different initial state distributions, such as different poses of the cabinet in the drawer opening task. We demonstrate that updating the distribution of the simulation parameters leads to a successful policy transfer in contrast to just using an initial distribution of the parameters without any updates. As we observe, training on very wide parameter distributions is significantly more difficult and prone to fail than updating a conservative initial distribution.

Next, we show that we can successfully transfer policies to real robots, such as ABB Yumi and Franka Panda, for complex articulated tasks such as cabinet drawer opening, and tasks with non-rigid bodies and complex dynamics, such as swing-peg-in-hole task with the peg swinging on a soft rope. The policies can be transferred with a very small amount of real robot trials and leveraging large-scale training on a multi-GPU cluster.

Iv-a Tasks

We evaluate our approach on two robot manipulation tasks: cabinet drawer opening and swing-peg-in-hole.

Iv-A1 Swing-peg-in-hole

The goal of this task is to put a peg attached to a robot hand on a rope into a hole placed at a 45 degrees angle. Manipulating a soft rope leads to a swinging motion of the peg, which makes the dynamics of the task more challenging. The task set up in the simulation and real world using a 7-DoF Yumi robot from ABB is depicted in Fig. LABEL:fig:title on the right. Our observation space consists of 7-DoF arm joint configurations and 3D position of the peg. The reward function for the RL training in simulation includes the distance of the peg from the hole, angle alignment with the hole and a binary reward for solving the task.

Iv-A2 Drawer opening

In the drawer opening task, the robot has to open a drawer of a cabinet by grasping and pulling it with its fingers. This task involves an ability to handle contact dynamics when grasping the drawer handle. For this task, we use a 7-DoF Panda arm from Franka Emika. Simulated and real world settings are shown in Fig. LABEL:fig:title on the left. This task is operated on a 10D observation space: 7D robot joint angles and 3D position of the cabinet drawer handle. The reward function consists of the distance penalty between the handle and end-effector positions, the angle alignment of the end-effector and the drawer handle, opening distance of the drawer and indicator function ensuring that both robot fingers are on the handle.

We would like to emphasize that our method does not require the full state information of the real world, e.g. we do not need to estimate the rope diameter, rope compliance etc. to update the simulation parameter distribution in the swing-peg-in-hole task. The output of our policies consists of 7 joint velocity commands and an additional gripper command for the drawer opening task.

Iv-B Simulation engine

We use NVIDIA Flex as a high-fidelity GPU based physics simulator that uses maximal coordinate representation to simulate rigid body dynamics. Flex allows a highly parallel implementation and can simulate multiple instances of the scene on a single GPU. We use the multi-GPU based RL infrastructure developed in [36] to leverage the highly parallel nature of the simulator.

Iv-C Simulated experiments

Fig. 2: An example of a wide distribution of simulation parameters in the swing-peg-in-hole task where it is not possible to find a solution for many of the task instances.

We aim at understanding what effect a wide simulation parameter distribution can have on learning robust policies, and how we can improve the learning performance and the transferability of the policies using our method to adjust simulation randomization. Fig. 2 shows an example of a significantly wide distribution of simulation parameters for the swing-peg-in-hole task. In this case, peg size, rope properties and size of the peg box were randomized. As we can observe, a large part of the randomized instances does not have a feasible solution, i.e. when the peg is too large for the hole or the rope is too short. Finding a suitably wide parameter distribution would require manual fine-tuning of the randomization parameters.

Fig. 3:

Performance of the policy training with domain randomization for different variances of the distribution of the cabinet position along the X-axis in the drawer opening task.

Fig. 4: Initial distribution of the cabinet position in the source environment, located at extreme left, slowly starts to change to the target environment distribution as a function of running 5 iterations of SimOpt.

Moreover, learning performance depends strongly on the variance of the parameter distribution. We investigate this in a simulated cabinet drawer opening task with a Franka arm which is placed in front of a cabinet. We randomize the position of the cabinet along the lateral direction (X-coordinate) while keeping all other simulation parameters constant. We train our policies on a 2 layer neural network with fully connected layers of 64 units each with PPO for 200 iterations. As we increase the variance of the cabinet position, we observe that the policies learned tend to be conservative

i.e. they do end up reaching the handle of the drawer but fail to open it. This is shown in Fig. 3

where we plot the reward as a function of number of iterations used to train the RL policy. We start with a standard deviation of 2cm (

) and increase it to 10cm (). As shown in the plot, the policy is sensitive to the choice of this parameter and only manages to open the drawer when the standard deviation is 2cm. We note that the reward difference may not seem that significant but realize that it is dominated by reaching reward. Increasing variance further, in an attempt to cover a wider operating range, can often lead to simulating unrealistic scenarios and catastrophic breakdown of the physics simulation with various joints of the robot reaching their limits. We also observed that the policy is extremely sensitive to variance in all three axes of the cabinet position i.e. policy only ever converges when the standard deviation is 2cm and fails to learn even reaching the handle otherwise.

Fig. 5: Policy performance in the target drawer opening environment trained on randomized simulation parameters at different iterations of SimOpt. As the source environment distribution gets adjusted, the policy transfer improves until the robot can successfully solve the task in the fourth SimOpt iteration.

Fig. 6: Running policies trained in simulation at different iterations of SimOpt for real world swing-peg-in-hole and drawer opening tasks. Left: SimOpt adjusts physical parameter distribution of the soft rope, peg and the robot, which results in a successful execution of the task on a real robot after two SimOpt iterations. Right: SimOpt adjusts physical parameter distribution of the robot and the drawer. Before updating the parameters, the robot pushes too much on the drawer handle with one of its fingers, which leads to opening the gripper. After one SimOpt iteration, the robot can better control its gripper orientation, which leads to an accurate task execution.

In our next set of experiments, we perform policy transfer from the source to target drawer opening scene where position of the cabinet in the target scene is offset by a distance of 15cm and 22cm. After training the policy with RL, it is run on the target scene to collect roll-outs. These roll-outs are then used to perform several SimOpt iterations to optimize simulation parameters that best explain the current roll-outs. We noticed that the RL training can be sped up by initializing the policy with the weights from the previous SimOpt iteration, effectively reducing the number of needed PPO iterations from 200 to 10 after the first SimOpt iteration. The whole process is repeated until the learned policy starts to successfully open the drawer in the target scene. We found that it took overall 3 iterations of doing RL and SimOpt to learn to open the drawer when the cabinet was offset by 15cm. Such a large distance of 15cm would have required the standard deviation of the cabinet position to be 10cm for any naïve domain randomization based training which fails to produce a policy that opens the drawer as shown in Fig. 3. Our method leverages roll-outs from the target scene and changes the distribution of cabinet position such that the training on this new distribution allows opening the drawer. We further note that the number of iterations increases to 5 as we increase the target cabinet distance to 22cm highlighting that our method is able to operate on a wider range of mismatch between the current scene and the target scene. Fig. 4 shows how the source distribution variance adapts to the target distribution variance for this experiment and Fig. 5 shows that our method starts with a conservative guess of the initial distribution of the parameters and changes it using target scene roll-outs until policy behaviour in target and source scene starts to match.

Iv-D Real robot experiments

In our real robot experiments, SimOpt is used to learn simulation parameter distribution of the manipulated objects and the robot. We run our experiments on 7-DoF Franka Panda and ABB Yumi robots. The RL training and SimOpt simulation parameter sampling is performed using a cluster of 64 GPUs for running the simulator with 150 simulated agents per GPU. In the real world, we use object tracking with DART [43]

to continuously track the 3D positions of the peg in the swing-peg-in-hole task and the handle of the cabinet drawer in the drawer opening task, as well as initialize positions of the peg box and the cabinet in simulation. DART operates on depth images and requires 3D articulated models of the objects. We learn multi-variate Gaussian distributions of the simulation parameters parameterized by a mean and a full covariance matrix, and perform several updates of the simulation parameter distribution per

SimOpt iteration using the same real world roll-outs to minimize the number of real world trials.

Robot properties
Joint compliance (7D) 1.0
Joint damping (7D) 1.0
Joint action scaling (7D) 0.02
Rope properties
Rope torsion compliance 2.0 0.07 1.89
Rope torsion damping 0.1 0.07 0.48
Rope bending compliance 10.0 0.5 9.97
Rope bending damping 0.01 0.05 0.49
Rope segment width 0.004 0.007
Rope segment length 0.016 0.004 0.017
Rope segment friction 0.25 0.03 0.29
Peg properties
Peg scale 0.01
Peg friction 1.0 0.06 1.0
Peg mass coefficient 1.0 0.06 1.06
Peg box properties
Peg box scale 0.029 0.01 0.034
Peg box friction 1.0 0.2 1.01
TABLE I: Swing-peg-in-hole: simulation parameter distribution.
Robot properties
Joint compliance (7D) 0.5
Joint damping (7D) 0.5
Gripper compliance -11.0 0.5 -10.9
Gripper damping 0.0 0.5 0.34
Joint action scaling (7D) 0.01
Cabinet properties
Drawer joint compliance 7.0 1.0 8.3
Drawer joint damping 2.0 0.5 0.81
Drawer handle friction 0.001 0.5 2.13
TABLE II: Drawer opening: simulation parameter distribution.

Iv-D1 Swing-peg-in-hole

Fig. 6 (left) demonstrates the behavior of the real robot execution of the policy trained in simulation over 3 iterations of SimOpt. At each iteration, we perform 100 iterations of RL in approximately 7 minutes and 3 roll-outs on the real robot using the currently trained policy to collect real-world observations. Then, we run 3 update steps of the simulation parameter distribution with 9600 simulation samples per update. In the beginning, the robot misses the hole due to the discrepancy of the simulation parameters and the real world. After a single SimOpt iteration, the robot is able to get much closer to the hole, however not being able to insert the peg as it requires a slight angle to go into the hole, which is non-trivial to achieve using a soft rope. Finally, after two SimOpt iterations, the policy trained on a resulting simulation parameter distribution is able to swing the peg into the hole in of the times when evaluated on 20 trials.

Table I shows the initial mean and the diagonal of the covariance matrix of the swing-peg-in-hole simulation parameters, and the shifted mean at the end of the SimOpt training. We observe that the most significant changes occur in the physical parameters of the rope that influence its dynamical behavior and the robot parameters, especially scaling of the policy actions.

Iv-D2 Drawer opening

For drawer opening, we learn a Gaussian distribution of simulation parameters initialized with a mean and a diagonal covariance matrix described in Table II. Fig. 6 (right) shows the drawer opening behavior before and after performing a SimOpt update. During each SimOpt iteration, we run 200 iterations of RL for approximately 22 minutes, perform 3 real robot roll-outs and 20 update steps of the simulation distribution using 9600 samples per update step. Before updating the parameter distribution, the robot is able to reach the handle and start opening the drawer. However, it cannot exactly replicate the learned behavior from simulation and does not keep the gripper orthogonal to the drawer, which results in pushing too much on the handle from the bottom with one of the robot fingers. As the finger gripping force is limited, the fingers begin to open due to a larger pushing force. After adjusting the parameter distribution including robot and drawer properties, as shown in Table II, the robot is able to better control its gripper orientation and by evaluating on 20 trials can open the drawer at all times keeping the gripper orthogonal to the handle.

Iv-E Comparison to trajectory-based parameter learning

In our work, we run a closed-loop policy in simulation to obtain simulated roll-outs for SimOpt optimization. Alternatively, we could directly set the simulator to states and execute actions from the real world trajectories as proposed in [29, 30]. However, such a setting is not always possible as we might not be able to observe all required variables for setting the internal state of the simulator at each time point, e.g. the current bending configuration of the rope in the swing-peg-in-hole task, which we are able to initialize but can not continually track with our real world set up.

Without being able to set the simulator to the real world states continuously, we still can try to copy the real world actions and execute them in an open-loop manner in simulation. However, in our simulated experiments we notice that especially when making particular state dimensions unobservable for SimOpt cost computation, such as X-position of the cabinet in the drawer opening task, executing a closed-loop policy still leads to meaningful simulation parameter updates compared to the open-loop execution. We believe in this case the robot behavior is still dependent on the particular simulated scenario due to the closed-loop nature of the policy, which also reflects in the joint trajectories of the robot that are still included in the SimOpt cost function. This means that by using a closed-loop policy we can still update the simulation parameter distribution even without explicitly including some of the relevant observations in the SimOpt cost computation.

V Conclusions

Closing the simulation to reality transfer loop is an important component for a robust transfer of robotic policies. In this work, we demonstrated that adapting simulation randomization using real world data can help in learning simulation parameter distributions that are particularly suited for a successful policy transfer without the need for exact replication of the real world environment. In contrast to trying to learn policies using very wide distributions of simulation parameters, which can simulate infeasible scenarios, we are able to start with distributions that can be efficiently learned with reinforcement learning, and modify them for a better transfer to the real scenario. Our framework does not require full state of the real environment and reward functions are only needed in simulation. We showed that updating simulation distributions is possible using partial observations of the real world while the full state still can be used for the reward computation in simulation. We evaluated our approach on two real world robotic tasks and showed that policies can be transferred with only a few iterations of simulation updates using a small number of real robot trials.

In this work, we applied our method to learning uni-modal simulation parameter distributions. We plan to extend our framework to multi-modal distributions and more complex generative simulation models in future work. Furthermore, we plan to incorporate higher-dimensional sensor modalities, such as vision and touch, for both policy observations and factors of simulation randomization.

Acknowledgements

We would like to thank Alexander Lambert, Balakumar Sundaralingam and Giovanni Sutanto for their help with the robot experiments, and David Ha, James Davidson, Lerrel Pinto and Fabio Ramos for their helpful feedback on the draft of the paper. We would also like to thank the GPU cluster and infrastucture team at NVIDIA for their help all the way through this project.

References

Vi Appendix

Tables III and IV show the SimOpt distribution update parameters for swing-peg-in-hole and drawer opening tasks including REPS [37] parameters, settings of the discrepancy function , weights of each observation dimension in the discrepancy function, and reinforcement learning settings such as parallelized PPO [35, 36] training parameters and task reward weights.

Simulation distribution update parameters
Number of REPS updates per SimOpt iteration 3
Number of simulation parameter samples per update 9600
Timesteps per simulation parameter sample 453
KL-threshold 1.0
Minimum temperature of sample weights 0.001
Discrepancy function parameters
L1-cost weight 0.5
L2-cost weight 1.0
Gaussian smoothing standard deviation (timesteps) 5
Gaussian smoothing truncation (timesteps) 4
Observation dimensions cost weights
Joint angles (7D) 0.05
Peg position (3D) 1.0
Peg position in the previous timestep (3D) 1.0
PPO parameters
Number of agents 100
Episode length 150
Timesteps per batch 64
Clip parameter 0.2
0.99
0.95
Entropy coefficient 0.0

Optimization epochs

10
Optimization batch size per agent 8
Optimization step size
Desired KL-step 0.01
RL reward weights
L1-distance between the peg and the hole -10.0
L2-distance between the peg and the hole -4.0
Task solved (peg completely in the hole) bonus 0.1
Action penalty -0.7
TABLE III: Swing-peg-in-hole: SimOpt parameters.
Simulation distribution update parameters
Number of REPS updates per SimOpt iteration 20
Number of simulation parameter samples per update 9600
Timesteps per simulation parameter sample 453
KL-threshold 1.0
Minimum temperature of sample weights 0.001
Discrepancy function parameters
L1-cost weight 0.5
L2-cost weight 1.0
Gaussian smoothing standard deviation (timesteps) 5
Gaussian smoothing truncation (timesteps) 4
Observation dimensions cost weights
Joint angles (7D) 0.5
Drawer position (3D) 1.0
PPO parameters
Number of agents 400
Episode length 150
Timesteps per batch 151
Clip parameter 0.2
0.99
0.95
Entropy coefficient 0.0
Optimization epochs 5
Optimization batch size per agent 8
Optimization step size
Desired KL-step 0.01
RL reward weights
L2-distance between end-effector and drawer handle -0.5
Angular alignment of end-effector with drawer handle -0.07
Opening distance of the drawer -0.4
Keeping fingers around the drawer handle bonus 0.005
Action penalty -0.005
TABLE IV: Drawer opening: SimOpt parameters.