Through cloud-based robotic systems, Robotics as-a-Service (RaaS) promises to alleviate many of the upfront requirements to install and maintain robot hardware [kehoe2015cloudrobsurvey]. The advantages of this model include easier management and scalability, greater flexibility and savings in terms of compute power and utilities. While RaaS has been proposed for parallel-grid computing on demand, collective robot learning, and crowd-sourcing access to remote human expertise, a great need for such a model has emerged in the robotics community during the current pandemic. In this work, we show how large-scale simulation done on a desktop grade GPU and cloud-based robotics can enable roboticists to perform research in robotic learning with modest resources. We focus on 6-DoF object manipulation by using a dexterous multi-finger manipulator as a case study.
Dexterous manipulation requires dealing with high-dimensionality of the system, hybrid dynamics, and uncertainties about the environment [okamura2000overview]. Prior work has trained a control policy for in-hand manipulation of a block with a Shadow Dexterous Hand [openai-sh]. This was achieved through scientific compute clusters for distributed training in simulation and access to specialized hardware (the “cage”) to receive reliable state information during object interaction in the real-world. While impressive, the requirement of such an exorbitant infrastructure makes this kind of study typically non-reproducible bordering impractical, hence the paucity of results building upon this work for further research into learning manipulation.
designed an open-sourced low-cost robotic platform for dexterous manipulation calledTriFinger. They showed that the robot is suitable for deploying learned policies due to its hardware robustness and software safety checks. Building upon this work, the authors organized the ‘Real Robot Challenge (RRC)’ [real-robot-challenge], for which they developed a farm of TriFinger systems. For the challenge the authors provided access to PyBullet simulation of the robot [coumans2013bullet]. This challenge reinforced the inaccessibility of applying learning-based systems for such tasks: most competing teams elected to use structured policies with manually tuned high-level controllers or residual learning on top [trifinger-benchmarking, rrc-submission-chen, rrc-submission-yoneda].
One of the reasons for the above is the ubiquity of CPU-based simulators [pybullet, MuJoCo]. Due to its low sample generating rate, it is highly time-consuming to tune and train a successful policy on PyBullet (a CPU based simulator) for such a complex task. The TriFinger platform is based on low-cost open-source hardware [trifinger-platform]
. The TriFinger system uses a vision-based tracker to triangulate the object pose. The tracker functions at a low frequency and provides noisy estimates of the cube pose, making reliable policy inference difficult. Lastly, working on the cloud-based platform Trans-Atlantic with limited access to the hardware slows down the iteration cycles. While the overall cost of the system is reduced when compared to a high-end setup like the Shadow Hand used by OpenAI, more noise and delays are present due to the commodity nature of the hardware. This makes sim-to-real transfer non-trivial.
This effort aims to overcome these limitations through a systems approach to robotics with infrastructure around a GPU-accelerated simulator coupled with a remote TriFinger system for successful sim-to-real transfer of a learned policy, as shown in Figure 1. Using NVIDIA’s IsaacGym [pmlr-v87-liang18a], we train a policy to learn 6-DoF object manipulation successfully in under a day on a single desktop-grade GPU. This number is in contrast to previous efforts, for example, OpenAI’s distributed training infrastructure, which took several days to learn a robust policy for cube rotation on a large distributed server involving hordes of CPU and GPU clusters. Additionally, we investigate different object pose representations for observations and rewards formulation. To allow successful sim-to-real transfer, we perform domain randomization on various physics properties in the simulator along with different noise models for observations and actions.
This paper primarily makes robot systems contributions as follows:
We provide a framework for learning similar in-hand manipulation tasks with sim-to-real transfer using far fewer computational resources (1 GPU & CPU) than prior work but that can also benefit from large scale training.
We show the benefits of using keypoints as representations of object pose and in reward computation with RL algorithms for in-hand manipulation, especially when reposing in .
We demonstrate the ability to learn a challenging 6-DoF manipulation task, re-posing a cube, with simulation data alone and deploying on a third-party remote physical robot setup.
We open-source the software platform to run training in simulation and inference of the resulting policies for other researchers to build on top of.
2 Related Work
Advances in reinforcement learning (RL) algorithms and computational hardware have enabled rapid progress in using these algorithms for tasks on real robots. Techniques such as domain randomization and large-scale training have enabled results across a variety of tasks with sim2real, including in-hand manipulation [openai-sh, openai-rubiks], as well as in legged locomotion [Hwangbo_2019, shi2020circus]. Active identification of system parameters has also been shown to be helpful in the context of learning manipulation tasks [chebotar2019closing]. However, few results have focused on learning-based control over a full 6-DoF pose in-hand. Furthermore, none of the existing systems for training in-hand manipulation at a large scale have been provided in an accessible manner for further work in robot learning to build on.
The most widely used simulators for robot learning research are MuJoCo [MuJoCo] and Pybullet [pybullet]. While both have proven to be successful for various robotics locomotion and manipulation tasks, they are often slow for complicated environments and require CPU clusters putting some limits on their scalability. Brax [brax2021github], on the other hand, supports GPU / TPU acceleration, but it comes at the cost of simplified physics simulation assumptions and simple environments. IsaacGym [pmlr-v87-liang18a] offers high-fidelity physics modelling and GPU acceleration support. It also supports directly sharing observations and actions through GPU memory between policy network and the physics engine, massively increasing throughput. Part of our contribution is to demonstrate the viability of the approach of GPU-based simulation to sim2real for in-hand manipulation.
3.1 Simulation Environment
We train on the IsaacGym simulator [pmlr-v87-liang18a], a simulation environment tailored towards allowing policy learning with a high sampling rate (>50K samples/sec in policy inference on Tesla V100 and around 100K samples/sec on RTX 3090) on a single GPU. This is crucial in our goal of providing an accessible yet generalisable framework for sim2real with in-hand manipulation. We simulate the physical system with , which we found gave a good balance between simulation fidelity, speed of learning, and ability to learn high frequency motions required in such a manipulation task.
3.2 Representation of the Cube Pose
Our task involves manipulating an object in 6 degrees of freedom. As such, we must represent this pose in multiple stages of our training pipeline. Prior work has shown the benefits of alternate representations of spatial rotation when using neural networks[zhou2020continuity]. We choose to represent the pose of the object using 8 keypoints sampled in the object’s local space at each vertex of the cube, . These locations of the keypoints in the object’s local frame are constant. We denote the keypoints in the world frame for the current pose of the object, and the goal pose of the object . These are obtained by a straightforward transformation of the back into the world frame using the current / goal poses of the object. This results in a set of 8 keypoints in 3-D Euclidean space. When concatenated for policy inference, the set of 8 keypoints yields a 24-D vector representing an object pose. In Sec. 4.2, we contrast this representation to a position+quaterinon formulation used in [openai-sh, pmlr-v87-liang18a], finding that it improves the policy’s success rate. We note as long as we are able to get the pose of the object (eg. via [tremblay2018corl:dope]), we are able to obtain the 8 keypoints of the bounding box, and that this does not depend on the object morphology (as shown in Appendix A.4).
3.3 Observation and Action Spaces
The observations of the policy (actor) and value-function (critic) are described in Table 0(a) and 0(b). We use the representation of cube pose described in the previous Sec 3.2. The action space of our policy is torque on each of the 9 joints of the robot. The torque on each joint is limited such that it does not damage the equipment while in operation. A safety damping is applied to the output torques in simulation to mimic those on the real-world robot.
3.4 Reward Formulation & Curriculum
Our reward has three components, each of which we found to be helpful in achieving good training and sim-to-real performance. Following [Hwangbo_2019], we use a logistic kernel to convert tracking error in euclidean space into a bounded reward function. We slightly generalise the given formulation to account for a range of distance scales, defining, , where is a scaling factor and controls the sensitivity to the kernel at low values of distance. We use and .
As noted in Sec 3.2, we use keypoints in order to calculate the reward. The component of the reward corresponding to the distance between the cube’s current pose and the desired target pose is given by: , where and are each of the keypoints at the corners of the current and target cubes, respectively.
In order to encourage the fingers to reach the cube during initial exploration, we give a reward for moving the fingers towards the cube, which was also found to be helpful in [causalworld]. This term is defined by sum of the movement of each fingertip towards the goal per timestep: , where denotes the change across the timestep of the fingertip distance to the centroid of the object, , and denotes the position of the -th fingertip.
Finally, we define a penalty on the movement of each finger, preventing sudden motions that may be difficult to execute reliably on the real robot:
Our total reward is defined as:
where , and are the weights of each reward component. We also found in initial experimentation that the curriculum reducing the weight of the fingertip_to_object reward to after million steps was needed in order to allow the robot to perform nonprehensile manipulation of the cube (releasing it in order to reorient). However, having the reward term during the initial phases of training dramatically sped up learning by encouraging the robot to interact with the cube.
3.5 Domain Randomization
Domain Randomization (DR) is a straightforward yet practical technique for improving the robustness of policies in sim2real transfer [openai-dr, Jason:ICRA:2018, mandlekar2017arpl]. We choose our Domain Randomization parameters to account for modelling errors in the environment as well as noise in sensor measurement noise. These parameters are listed in Table 2. In addition to these randomizations, we apply random forces to the cube in the same manner as described in [openai-sh] in order to improve the stability of grasps and represent unmodelled dynamics. We mimic the dynamics of the camera on the real system, described in Sec 3.7, by repeating the observation of the keypoints for 5 frames.
are the standard deviation of additive gaussian noise sampled every timestep and at the start of each episode, respectively. For environment, the parameters represent scaling factor applied to the nominal values in the real robot model.
3.6 Policy Architecture & Learning
We train using the Proximal Policy Optimization algorithm [schulman2017proximal], using the implementation from [rl-games], which vectorizes observations and actions on GPU allowing us to take advantage of the parallelization provided by the simulator (see Sec 3.1). We use the following hyper-parameters: discount factor , clipping . The learning rate is annealed linearly over the course of training from to
; detailed hyperparameters are described in AppendixA.1 The inputs to the policy are described in Table 0(b). We use an asymmetric actor critic approach [asymmetric-ac] with an actor that has 4 hidden layers, 2 of size 256 followed by 2 of size 128, and 9 outputs which are scaled to the torque ranges of the real robot and a critic that has 2 layers of size 512, followed by 2 layers of size 256 and 128 each and produces a scalar value function as output.
3.7 Policy Inference on Remote Real Robot
We evaluate our policy remotely on the TriFinger system [trifinger-platform] provided by the organisers of the real robot challenge [trifinger-benchmarking]. The cube is tracked on the system using 3 cameras, described in [trifinger-object-tracking]. We convert the position+quaternion representation output by this system into the keypoints representation described in 3.2 and use it as input to the policy. Observations of the cube pose from the camera system are provided at 10Hz. Proprioceptive measurements are available at up to 1KHz. Our policy is evaluated at 50Hz, matching the simulation timestep. We repeat the camera-based cube-pose observations for subsequent rounds of policy evaluation to allow the policy to take advantage of the higher-frequency proprioceptive data available to the robot. The resulting observations are identical to what we use in simulation (Table 0(a)).
In our experiments, we aim to answer the following four questions pertaining to learning a robust policy for this task, as well as evaluating how well it transfers to the real world:
[noitemsep, wide, labelwidth=!, labelindent=0pt]
How well does our system using large-scale simulation train on this task with a reward function similar to what has been previously proposed?
How does training performance change when we use a different representation - keypoints - for reward computation and policy input?
Is our policy robust to sensor noise and varying environment parameters, and is robustness improved by use of Domain Randomization during training?
How well do our policies, trained entirely in simulation, transfer to the real TriFinger system?
4.1 Experiment 1: Training
The aim in our 6-DoF manipulation task is to get the position and orientation of the cube to a specified goal position and orientation. We define our metric for ‘success’ in this task as getting the position within 2 cm, and orientation within 22of the target goal pose as used in [openai-sh]; comparable to mean results obtained in [trifinger-benchmarking]. Following previous works dealing with similar tasks [openai-sh, openai-rubiks, causalworld], we attempted applying a reward based on the position and orientation components of error individually.
We spent considerable effort experimenting with a variety of kernels and tuning the parameters of the translation / rotation based reward. The best candidate reward of this format was: , where . The argument of is the logistic kernel that takes L2 norm between the current and target cube position as input, and is the distance in radians between the current and target cube orientation. We use the alternative scaling parameter in , which we found to work better in this reward formulation (see Sec. 3.4). We use the same weightings for each of the 3 components of the reward as in Sec 3.4.
The results are shown in Figure 4. We found that while this formulation of the reward was good at allowing PPO to learn a policy to get the cube to the goal, even after 1 Billion steps in an environment with no Domain Randomization it was learning very slowly to achieve the orientation goal.
4.2 Experiment 2: Representation of Pose
The poor results in Experiment 1 (Sec 4.1) lead us to search for alternative representations of cube pose in the calculation of the reward and policy observations; these are described in Sec 3.2 & 3.4. We compared our method of using keypoints to represent the object pose and using positions and quaternions along two axes. Firstly, using it as the policy input as compared to a position and quaternion representation, and secondly, using it to calculate the reward as compared to a reward based on the linear and angular rotational distances individually.
For the observations, in order to provide a fair comparison between position/quaternion and keypoints as policy input, we ensured that we applied observation noise and delays in the same manner (by applying them in the position and quaternion space before transforming to keypoints, as noted in Sec. 3.5). Also note that both representations only rely on the spatial pose information of the cube to compute. We represent the pose of the cube with a 7-dim vector involving translation and quaternion (, ). We provide the position and quaternion of the goal pose as input to the actor and critic, replacing the keypoints in Tables 0(a) & 0(b).
For the reward, in order to provide a fair comparison to the keypoints reward, as mentioned previously we spent many hours tuning the kernels and parameters used in the translation based reward, described in Experiment 1. In comparison, little effort was spent tuning the keypoints function, with only one tweak to the weightings in the logistic kernel, showing the relative simplicity of working with this formulation.
shows the results of training, with both timesteps and wall-clock time. In the curve without any Domain Randomization, we trained for 1 Billion steps over the course of 6 hours on a single GPU. Using Keypoints in observations and the reward function performs the best of the four policies, also exhibiting a low variance among seeds.
When Domain Randomization is applied, the two curves with a keypoints-based reward are far better in terms of success rate at the end of training and in terms of convergence rate, however in this case having observations be keypoints seems to matter somewhat less. This is perhaps due to the longer training (4B steps & 24 hours on a single GPU) overwhelming the inductive bias introduced by using keypoints as representations. However, using keypoints to compute the reward provided a large benefit in both cases, showing the improvement caused by calculating the reward in Euclidean space rather than mixing linear and angular displacements through addition.
4.3 Experiment 3: Robustness of Policies in Simulation
In order to investigate the impact that Domain Randomization (DR, see Sec 3.5) has on the robustness of policies of a hand in this configuration, we ran experiments by varying parameters outside of the normal domain randomization ranges in simulation. Figure 6 shows the results. We find that, despite only being randomised initially within a range of 0.97-1.03x nominal size, our policies with Domain Randomization achieve over an 80% success rate even with a scale of 0.6 and 1.2x nominal size, while those without DR have a success rate that drops off much more quickly outside the normal range. We find similar results when scaling the object mass relative to the nominal range, however in this case we find that the policies using keypoints-based reward even without DR is much more robust at masses 3x nominal.
4.4 Experiment 4: Simulation to Remote Real Robot Transfer
We ran experiments on the real robot to determine the success rate of the policies trained with Domain Randomization under the metric defined in Sec 4. We performed trials for each policy; the results for each of the four ablations on keypoints which we tested are shown in Figure 7.
Out of the four models discussed in 4.2, the best policy achieved a success rate of 82.5%. This was achieved with the use of keypoints used in observations of the policy as well as the reward function during training (O-KP+R-KP). The policy using position+quaternion representations but with a reward calculated with keypoints (O-PQ+R-KP) achieved a 77.5% success rate. These first two policies were well within each others’ confidence intervals. This is likely due to the impact of the better representation of keypoints being mitigated somewhat after 4 Billion steps of training, as discussed in 4.2. In contrast, neither of the policies trained using the position & quaternion based reward achieved good success rates, with the policy using keypoints-based observations (O-KP+R-PQ) achieving only a 60% success rate while the one with position and quaternion observations (O-PQ+R-PQ) only achieved a 55% success rate. These results show the importance of having a reward function which effectively balances learning to achieve the goal in and in order to have policies with a high success rate in simulation and thus a high corresponding success rate after real robot transfer.
We noticed a variety of emergent behaviours used to achieve sub-goals within the overall cube-reposing task. We display some of these in the panel in Figures 1 and 8. The most prominent of these is "dropping and regrasping". In this maneuver, the robot learns to drop the cube when it is close to the correct position, re-grasp, and pick it back up. This enables the robot to get a stable grasp on the cube in the right position. The robot also learns to use the motion of the cube to the correct location in the arena as an opportunity to simultaneously rotate it on the ground to make achieving the correct grasp in challenging target locations far from the center of the fingers’ workspace. Our policy is also robust towards dropping - it is able to recover from a cube falling out of the hand and retrieve it from the ground.
This paper emphasizes the empirical value of a systems approach to robot learning through a case study in dexterous manipulation. We introduced a framework for learning in-hand manipulation tasks and transferring the resulting policies to the real world. Using GPU-based simulation, we showed how this can be done with order of magnitude fewer computational resources than prior work. Furthermore, we show how RL algorithms for in-hand manipulation can benefit from using keypoints as opposed to the more ordinary angular and linear displacement-based reward and observation computation. This paper shows a path for democratization of robot learning and a viable solution through large scale simulation and robotics-as-a-service.
Appendix A Appendix
a.1.1 Learning Algorithm Details
|GAE Discount Factor ()||0.95|
|Learning Rate (start of training)||5e-4|
|Learning Rate (end of training, linear decay)||1e-6|
Number of Epochs
|Clip Range ()||0.2|
We used the open-source version of PPO from [rl-games] which provides the ability to work with highly vectorised environments. The hyperparameters used are listed in Table 3.
a.2 Success for rotation and position
We break out position and rotation success rates individually in Figure 9. They show that the keypoint-based reward formulation fixes the issues identified in Experiment 1 from the paper, namely that summing position and orientation components of reward leads to poor orientation success rate. Using keypoints improves orientation performance without sacrificing achieving the position goal. It is still apparent that progress can be made with reducing this gap as the orientation reward still continues improving until 4 Billion steps of experience, and this is a direction of ongoing work.
a.2.1 Reproducing results on a consumer GPU
The reward curves and times stated in the main text were produced on a single NVIDIA V-100 GPU. We were able to reproduce these results on a desktop machine with a consumer-grade NVIDIA RTX3090 GPU. This produced the same reward curves but actually reduced the training times from around 24 hours to 20 hours, showing the ability of our system to train on a desktop.
a.3 Details of Sim2Real Transfer
a.3.1 Success Thresholds
The success rates on the real robot for different thresholds of position and orientation are shown in Figure 10. We see a graceful degradation as the success thresholds are tightened. We note that these are necessarily based on noisy camera observations due to the remote nature of the setup; at 0.01m of position and 0.1rad of orientation error this becomes a particular problem. Note also that ’success’ for us is based off a different metric than some other works (eg. [openai-sh]): we define ’success’ as being within the goal at the end of an episode instead of achiving it at any point during it. This is because part of the challenge of the Trifinger orientation task is being able to grasp and hold the cube in position, as the upside-down orientation of Trifinger making this challenging.
a.3.2 Hardware setup
As mentioned in the main text, we perform inference on the Trifinger platform remotely. The interface is described in the corresponding whitepaper [trifinger-platform].
Inference, including camera tracking and running the network, is performed on CPU on the same computer that hand-written solutions to last year’s real robot challenge [trifinger-benchmarking, rrc-submission-chen, rrc-submission-yoneda] were written on. An entire setup to run our system, including training, inference and physical robot hardware, could be purchased for less than US$10,000.
a.3.3 Software details
Inference is done in the Python; the time from getting the observations to sending the actions to the hardware platform is on the order of 5-8ms, a delay consisting of generating keypoints observations and running the policy. Reducing this delay by moving our inference code to C++ is a direction for future improvements to our system.
a.3.4 Pose Filtering
Unlike some previous works using visual information to perform in-hand manipulation, our system uses the pose estimator provided in [trifinger-object-tracking]. This performs iterative optimization without reference to the history, and thus can provide temporally inconsistent quaternion inputs to the policy, with the quaternion value flipping between and . We found that this destabilised the policies which were provided position and quaternion inputs during inference, and so implemented a simple filter over the input: if the quaternion from the last camera measurement was within 0.2 of the negated quaternion from a new camera measurement , we used in the policy input. While this had no impact on the keypoints model (it performs an analytic transformation prior to policy inference which is invariant to this issue) we found it important to perform this transformation to allow stable grasps in policies which took raw quaternions as input and thus to provide a fair comparison.
We tried using an Extended Kalman Filter using the formulation from[monoslam] in order to account for the noise in camera observations. However, we did not find that the performance of our policies on the real-robot was noticeably improved as compared with policies, likely due to the high variance in the unknown acceleration in in-hand manipulation.
a.4 Other objects
We experimented with our system to see what the 0-shot transfer performance to different object morphologies was. In order to do this, we swapped the objects in the simulator, ran inference the O-KP+R-KP policy (see Section 4) that produced the best sim2real transfer results, and measured the success rate. We do not change the keypoints representation, but rather keep the 8 keypoints as if they lie on the original 6.5cm3 cube despite the changing object morphology. We were only able to perform these experiments in simulation, as in the remote Trifinger setup we did not have the capacity to swap out objects. However, using an off the shelf pose detector (eg. [tremblay2018corl:dope]) we are confident that the same system would produce good sim2real transfer results.
We tested a sphere of diameter similar to the cube side length, cuboids of different sizes, and a few objects from the YCB dataset [YCB]. The results are listed in Table 4. Our our policy generalises surprisingly well on different object morphologies, for example by achieving nearly 70% accuracy on a mug (depicted in figure 3 in Figure 11). However, it struggles with long and skinny objects. This is unsurprising given the difficulty in grasping the cube at less than 0.5x the original scale (or 3cm).
|Cube 6.5cm [Training Object]||92.1%|
|Cuboid 4x6.5x4cm||94.6 %|
|YCB Mug (025_mug)||68.8 %|
|YCB Banana (011_banana)||28.0 %|
|YCB Potted Meat Can (010_potted_meat_can)||81.1 %|
|YCB Foam Brick (061_foam_brick)||91.7 %|