Log In Sign Up

Real Robot Challenge using Deep Reinforcement Learning

by   Robert McCarthy, et al.

This paper details our winning submission to Phase 1 of the 2021 Real Robot Challenge, a challenge in which a three fingered robot must carry a cube along specified goal trajectories. To solve Phase 1, we use a pure reinforcement learning approach which requires minimal expert knowledge of the robotic system or of robotic grasping in general. A sparse goal-based reward is employed in conjunction with Hindsight Experience Replay to teach the control policy to move the cube to the desired x and y coordinates. Simultaneously, a dense distance-based reward is employed to teach the policy to lift the cube to the desired z coordinate. The policy is trained in simulation with domain randomization before being transferred to the real robot for evaluation. Although performance tends to worsen after this transfer, our best trained policy can successfully lift the real cube along goal trajectories via the use of an effective pinching grasp. Our approach outperforms all other submissions, including those leveraging more traditional robotic control techniques, and is the first learning-based approach to solve this challenge.


Robotic Grasping using Deep Reinforcement Learning

In this work, we present a deep reinforcement learning based method to s...

Improvements on Hindsight Learning

Sparse reward problems are one of the biggest challenges in Reinforcemen...

Diversity-based Trajectory and Goal Selection with Hindsight Experience Replay

Hindsight experience replay (HER) is a goal relabelling technique typica...

Reward Engineering for Object Pick and Place Training

Robotic grasping is a crucial area of research as it can result in the a...

Optimizing Bipedal Maneuvers of Single Rigid-Body Models for Reinforcement Learning

In this work, we propose a method to generate reduced-order model refere...

Learning to Grasp on the Moon from 3D Octree Observations with Deep Reinforcement Learning

Extraterrestrial rovers with a general-purpose robotic arm have many pot...

Code Repositories

1 Real Robot Challenge Phase 1

Dexterous robotic manipulation is applicable in various industrial and domestic settings. However, current state-of-the-art control strategies generally struggle when faced with unstructured tasks which require high degrees of dexterity. Data-driven learning methods are promising for these challenging manipulation tasks, yet related research has been limited by the costly nature of real robot experimentation. In light of these issues, the Real Robot Challenge (RRC) aims to advance the state-of-the-art in robotic manipulation by providing participants with remote access to well maintained robotic platforms, allowing for cheap and easy real robot experimentation. To further support easy experimentation, users are also provided a simulated recreation of the robotic setup.

After an initial RRC qualifying phase, successful participants are entered into Phase 1 where they must solve the challenging ‘Move Cube on Trajectory’ task. In this task, a cube must be carried along a goal trajectory, which specifies the coordinates at which the cube should be positioned at each time-step, using the provided TriFinger robotic platform [3]. For final Phase 1 evaluation, participants submit their developed control policy and receive a score based on how closely it can bring the cube to a number of randomly sampled goal trajectories.

‘Move Cube on Trajectory’ requires a dexterous policy that can adapt to the various different cube and goal positions that may encountered during a single evaluation run. In the 2020 Real Robot Challenge, solutions to this task consisted of structured policies which relied heavily on inductive biases and task specific engineering [5, 6]. We take an alternative approach by formulating the task as a pure reinforcement learning (RL) problem, using RL to learn our control policy entirely in simulation before transferring it to the real robot for final evaluation. Upon this evaluation, our learned policy outperformed all other competing submissions.

2 Background

Goal-based Reinforcement Learning.

We frame the RRC robotic environments as a Markov decision process (MDP), defined by the tuple

. , , and are the state, action and goal spaces, respectively. The state transition distribution is denoted as , the initial state distribution as , and the reward function as . discounts future rewards. The goal of the RL agent is to find the optimal policy that maximizes the expected sum of discounted rewards in this MDP: .

Hindsight Experience Replay (HER).

HER [2] can be used with any off-policy RL algorithm in goal-based tasks. In these tasks, transition tuples collected in the MDP take the form of , where is the goal, and the reward function is usually sparse and binary (e.g. equation 1). To improve learning in the sparse reward setting, HER employs a simple trick when sampling previously collected transitions for policy updates: a proportion of sampled transitions have altered to , where is a goal achieved later in the episode. The rewards of these altered transitions are then recalculated with respect to , leaving the altered transition tuples as . Even if the original episode was unsuccessful, these altered transitions will teach the agent how to achieve , thus accelerating its acquisition of skills.

3 Method

3.1 Simulated Environment

Actions and Observations.

Pure torque control is used with an action frequency of 20 Hz (i.e. each time-step in the environment is 0.05 seconds). The robot has three arms, with three motorised joints in each arm; thus the action space is 9-dimensional. Observations include: (i) robot joint positions, velocities, and torques; (ii) the provided estimate of the cube’s pose (i.e. its estimated position and orientation), along with the difference between the pose of the current and previous time-step; and (iii) the coordinates at which the cube should currently be placed (i.e. the active goal of the trajectory). In total, the observation space has 44 dimensions.

Training Episodes.

In each simulated training episode, the robot begins in its default position and the cube is placed randomly on the arena floor. Episodes last for 90 time-steps, with the active goal of the randomly sampled goal trajectory changing every 30 time-steps.

Domain Randomisation.

To aid successful transfer of the learned policy from simulation to the real environment, we use some basic domain randomisation during training222Our domain randomization implementation is based on that from last years benchmark code [5] [4]. This includes uniformly sampling different parameters for the simulation physics (e.g. robot mass, restitution, damping, friction) and for the cube properties (mass and width) each episode. To account for real robot inaccuracies, uncorrelated noise is added to actions and observations within episodes333Full implementation details can be found in our code:

3.2 Learning Algorithm

The goal-based nature of the ‘Move Cube on Trajectory’ task makes HER a natural fit; HER has excelled in similar goal-based robotic tasks [2] and obviates the need for complex reward engineering. As such, we use HER, with Deep Deterministic Policy Gradients (DDPG) [1], as our RL algorithm444Our DDPG + HER implementation is taken from

Observing in our early experiments that standard DDPG + HER was slow in learning to lift the cube, we resolve this issue by: altering slightly the HER process and incorporating an additional dense reward which encourages the learning of lifting behaviors, as is now described.


In our approach, the agent receives two reward components: (i) a sparse reward based on the the cube’s x-y coordinates, , and (ii) a dense reward based on the cube’s z coordinate/height, .

The sparse x-y reward is calculated as:


where are the x-y coordinates of the achieved goal (the actual x-y coordinates of the cube), and are the x-y coordinates of the desired goal.

The dense z reward is defined as:


where and are the z-coordinates of the cube and goal, respectively, and is a parameter which weights relative to (we use ). teaches the agent to lift the cube by encouraging minimisation of the vertical distance between the cube and the goal. It is less punishing when the cube is above the goal, serving to further encourage lifting behaviours.

We only apply HER to the x-y coordinates of the goal, i.e., the x-y coordinates of the goal can be altered in hindsight but the z coordinate always remains unchanged: . Thus, only is recalculated after HER is applied to a transition sampled for policy updates. This reward system is motivated by the following:

  • Using with HER allows the agent to easily learn to push the cube around in the early stages of training (without requiring any complicated reward engineering), even if it cannot yet pick up the cube to reach the z-coordinate of the goal. As the agent learns to push the cube around in the x-y plane of the arena floor, it can then more easily stumble upon actions which lift it.

  • In the early stages of training the cube mostly remains on the floor. During these stages, most sampled by normal HER will be on the floor and so the agent can often be punished by for briefly lifting the cube. Since we only apply HER to the x-y coordinates of the goal, our HER altered goals, , maintain their original z height - leaving more room for the agent to be rewarded by for any cube lifting it performs.

Goal Trajectories.

In each episode, the agent is faced with multiple goals; it must move the cube from one goal to the next along a given trajectory. To ensure the HER process remains meaningful in these multi-goal episodes, we only sample future achieved goals, , (to replace ) from the period of time in which was active.

In our implementation, the agent is not aware that it is dealing with trajectories; when updating the policy with transitions we always set , even if in reality was different555Interestingly, we found that exposing the agent (during updates) to transitions in which hurt performance significantly, perhaps due to the extra uncertainty this introduces to Q-function estimates.. Thus, the policy focuses solely on achieving the current active goal and is unconcerned about the possibility of the active goal changing.

Exploration vs Exploitation.

We base our DDPG + HER hyperparameters on Plappert et al.


, who use a highly exploratory policy when collecting data in the environment: with probability 30% a random action is sampled (uniformly) from the action-space, and when policy actions are chosen, Gaussian noise is applied. This is beneficial for exploration in the early stages of training, however, it can be limiting in the later stages when the policy must be fine-tuned; we found that the exploratory policy repeatedly drops the cube due to the randomly sampled actions and injected action noise. Rather than slowly reducing the level of exploration each epoch, we collect 90% of rollouts with the exploratory policy and the remaining 10% with the standard exploiting policy. This addition was sufficient to boost final performances.

4 Results

4.1 Simulation

Our method is highly effective in simulation. The algorithm can learn from scratch to proficiently grasp the cube and lift it along goal trajectories. Figure 2 (a) compares the training performance of our final method to that of standard HER666These runs did not use domain randomization. Generally we trained from scratch in standard simulation before fine-tuning in a domain-randomized simulation. Throughout different training runs, our policies learned several different manipulation strategies, the most distinct of which included: (i) ‘pinching’ the cube with two arm tips and occasionally supporting it with the third, and (ii) ‘cradling’ the cube with all three of its forearms (see Figure 1).

(a) Simulated training
(b) Final leaderboard
Figure 2: (a): Success rate vs experience collected. We compare training with (i) HER applied to a standard sparse reward (blue), (ii) HER applied to both and (orange), and (iii) our final method, where HER is applied to but not . An episode is deemed successful if, when complete, the final goal of the trajectory has been achieved. (b): The official leaderboard after final Phase 1 evaluation on the real robot. Our team name was ‘thriftysnipe’.

4.2 Real Robot

Our final policies transferred to the real robot with reasonable success. Table 1 displays the self-reported scores of our best pinching and cradling policies under RRC Phase 1 evaluation conditions. As a baseline comparison, we trained a simple ‘pushing’ policy which ignores the height component of the goal and simply learns to push the cube to the goal’s x-y coordinates. The pinching policy performed best on the real robot, and is capable of carrying the cube along goal trajectories for extended periods of time and recovering the cube when it is dropped. This policy was submitted for the official RRC Phase 1 final evaluation and obtained the winning score seen in Figure 2 (b).

The domain gap between simulation and reality was significant, and generally led to inferior scores on the real robot. Policies often struggled to gain control of the real cube which appeared to slide more freely than in simulation. Additionally, policies could occasionally become stuck with an arm-tip pressing the cube into the wall. As a makeshift solution to this issue, we assumed the policy was stuck whenever the cube had not reached the goal’s x-y coordinates for 50 consecutive steps, then uniformly sampled random actions for seven steps in an attempt to ‘free’ the policy from its stuck state.

Pushing Cradling Pinching
Simulation -20,3993,799 -6,3491,039 -6,1981,840
Real robot -22,137  3,671 -14,207  2,160 -11,489  3,790
Table 1: Self-reported evaluation scores of our learned pushing, cradling, and pinching policies upon deployment on the simulated and real robots (mean standard deviation score over 10 episodes).

5 Discussion

Our relatively simple reinforcement learning approach fully solves the ‘Move Cube on Trajectory’ task in simulation. Moreover, our learned policies can successfully implement their sophisticated manipulation strategies on the real robot. Unlike last years benchmark solutions [5], this was achieved with the use of minimal domain-specific knowledge. We outperform all competing submissions, including those employing more classical robotic control techniques.

Due to the large domain gap, the excellent performances in simulation could not be fully matched upon transfer to the real robot. Indeed, the main limitation of our approach was the absence of any training on real robot data. It is likely that some fine-tuning of the policy on real data would greatly increase its robustness in the real environment, and developing a technique which could do so efficiently is one direction for future work. Similarly, domain adaptation techniques could be employed to produce a policy more capable of adapting to the real environment [8, 9]. However, ideally the policy could be learned from scratch on the real system; a suitable simulator may not always be available. Although our results in simulation were positive, the algorithm is still somewhat sample inefficient, requiring nearly one week of experience to converge. Thus, another important direction for future work would be to reduce sample complexity so as to increase the feasibility of real robot training, perhaps achievable via a model-based reinforcement learning approach [10, 11].



This publication has emanated from research supported by Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289_P2, co-funded by the European Regional Development Fund and by Science Foundation Ireland Future Research Leaders Award (17/FRL/4832).