Flexible Robotic Grasping with Sim-to-Real Transfer based Reinforcement Learning

03/13/2018 ∙ by Michel Breyer, et al. ∙ ETH Zurich 0

Robotic manipulation requires a highly flexible and compliant system. Task-specific heuristics are usually not able to cope with the diversity of the world outside of specific assembly lines and cannot generalize well. Reinforcement learning methods provide a way to cope with uncertainty and allow robots to explore their action space to solve specific tasks. However, this comes at a cost of high training times, sparse and therefore hard to sample useful actions, strong local minima, etc. In this paper we show a real robotic system, trained in simulation on a pick and lift task, that is able to cope with different objects. We introduce an adaptive learning mechanism that allows the algorithm to find feasible solutions even for tasks that would otherwise be intractable. Furthermore, in order to improve the performance on difficult objects, we use a prioritized sampling scheme. We validate the efficacy of our approach with a real robot in a pick and lift task of different objects.



There are no comments yet.


page 1

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In order for robots to assist us in everyday tasks, they need to be able to explore and interact with unstructured and dynamic environments found outside of traditional assembly lines and research labs. Robust manipulation of objects is a key component in all robotic applications that require interaction with their surroundings. To perform a task, the robot needs to be able to perceive the environment through its sensors, plan and execute the next action, and at the same time handle noisy data, external disturbances and real-world uncertainties. These challenges, together with the interdisciplinary nature of the problem, make this research area very complex. Traditional approaches usually use methods from computer vision to interpret the sensor data and then use an analytic approach (policy) to plan the next action. Manually designing policies that can cope with the complexity of the high-dimensional sensory input is difficult and often results in very tailored solutions to a given problem, which can be fragile to changes in the setup or task definition. Data-driven approaches, however, have shown to be powerful in such cases, given enough experience. rl is a general framework for training agents to acquire desired skills from trial-and-error by providing a reward for successful executions. It is able to find a complex mapping from a high-dimensional input space to the desired actions, without the need to explicitly model this relationship. However, depending on the complexity of the task, large amount of data might be required to learn the desired behaviour. In particular, the exploration phase can take a long time in the presence of large and continuous state and action spaces. To reduce the training time, the complexity of the manipulation task can be incrementally increased, thus allowing the learning algorithm to converge faster at each step. Simulators can be used as a less expensive and faster alternative to real-world data collection. However, transferring policies from simulation to the real world presents numerous difficulties, like sensor noise, contact physics approximations, simulated friction inaccuracies, etc., that can influence the final result on the real system.

Fig. 1: We consider the problem of learning closed-loop policies for the combined task of reaching, grasping and lifting objects. The policies map images captured by a wrist-mounted depth camera to end effector motion and gripper opening and closing actions. We compare different approaches to improve the efficiency of training and performance of the final controllers, including reward function shaping, designing a curriculum of tasks with increasing difficulty, as well as using a partially scripted policy to provide a warm start for the full problem. Policies are trained in simulation and evaluated on a set of unseen objects, both in simulation and real-world experiments.

In this work, we explore rl approaches to train agents that interact with their environment in a fully closed-loop manner in order to maximize future reward. We learn policies for the full task of reaching, grasping and lifting, which map depth images captured from a wrist-mounted camera to end effector displacements and gripper actions of a robotic arm, without relying on heuristics for the grasp decision. We explore different mechanisms to reduce training costs. First, we separate perception and control by learning a compressed image representation using the latent space of an autoencoder. Second, following the methodology of curriculum learning [1], we guide the training of our models by progressively increasing the workspace with the agent’s performance and compare this method against reward function shaping and using a heuristic to bootstrap the full problem. Finally, the entire training is performed in simulation and we report the required adjustments and findings from transferring policies to a real platform. In summary, the contributions of this paper are:

  • A closed-loop end-to-end formulation for the combined task of reaching, grasping and lifting different objects.

  • A case study of applying curriculum learning to guide training on this challenging task, including a comparison to alternative approaches.

  • A presentation of findings from transferring policies learned exclusively in simulation to a real-world table clearing task.

Ii Related Work

The task of reaching, grasping, and lifting can be solved using a large variety of approaches. First, we present a set, which we think is representative of general methods to solve this problem. In the second part, we highlight a selection of rl formulations, their robotic applications, and how they are used to partially or completely solve this manipulation task.

Grasp planning

considers the problem of detecting grasp candidates that maximize the probability of success for a given environment and gripper configuration. Early approaches relied on geometric reasoning, often assuming knowledge of the shape and physical properties of the involved objects 

[2, 3]. Data-driven approaches on the other hand aim at learning models from labeled data that can exploit visual cues and generalize to unseen objects [4]. Lenz et al. [5] trained a deep neural network on a small set of human-labeled images to predict the success of grasps on novel images. A different approach is to exploit analytic grasp theory to generate labeled data from synthetic point clouds [6, 7, 8] while a third line of research learns their models in a self-supervised manner directly from physical trials [9, 10]. Compared to the first two classes, the self-supervised approaches do not require any prior knowledge on grasp theory or human labeled samples. While some of the mentioned works improve robustness by iteratively recomputing the best grasp configuration [10, 8], they do not consider the long-term consequences of actions required to learn more complex behaviors. To enable an agent to learn such action sequences, one can pose this task as a rl problem.

Reinforcement learning [11, 12] is a general framework that considers autonomous agents who learn to choose sequences of control decisions that maximize some long-term measure of reward. To tackle the high-dimensional and continuous problems typically found in robotic applications, early works relied on task-specific, hand-engineered policy representations [13, 14]. Combining rl with the expressive power of deep neural networks has lead to some impressive results in various complex decision making problems [15, 16]. Due to the high data requirements, popular benchmarks often focus on video games [15] and simulated control problems [17, 18]. However, a number of works have applied rl to real-world manipulation tasks. One of the most notable ones is the Guided Policy Search [19], which trains a large neural network policy in a supervised manner on samples collected with trajectory-based rl. Other works tackle individual skills, such as opening doors [20] or lifting and stacking blocks [21]. Our problem formulation is closest to Quillen et. al. [22], who compared different off-policy rl algorithms for bin-picking in clutter with a large set of training and unseen test objects. This work got extended [23] to include gripper actions and a decision variable on when to terminate an episode. In the last two approaches the training data is generated from a scripted policy. In contrast to their approaches, additionally to a heuristic policy initialization, we explore curriculum learning to make the problem tractable.

Sparse reward formulations are naturally suited for many goal-oriented manipulation tasks, but also create challenges leading to techniques such as augmenting reinforcement signals through reward shaping [20, 21], learning from expert demonstrations [13, 24, 25] and curriculum learning [1]

. The latter proposes to guide learning by presenting training samples in a meaningful order with increasing complexity and has been applied to supervised learning for sequence prediction 

[26] and rl to acquire a curriculum of motor skills of an articulated figure [27]. Akin to curriculum learning, Popov et al. [21] sample initial states along expert trajectories. Recent related work proposed to use a gan to automatically generate goals of increasing difficulty [28], generate start state distributions that gradually expand from a given goal state [29] and training a teacher to automatically choose samples for the learner [30]. In our work, due to the large diversity of objects, goal states are not easily available. Therefore, our curriculum schedule increases both the space from which initial states are sampled as well as the final lifting height, where the target reward is awarded.

Opposed to using large-scale data collection on real robots [9, 10][23], we perform training in simulation as a less expensive, faster and, safer alternative. However, deploying policies learned in simulation on a real system requires to bridge the reality gap induced by differences in sensing and dynamics, and is a very active field of research. One approach is to close the gap by making the simulation match the real system as closely as possible through system identification [31]. Another approach is to expose the learning agent to a range of different environments through domain randomization, forcing it to learn a robust representation that generalizes to the real-world [32, 33]. Lastly, models can be adapted to new domains, e.g. by using progressive networks [34]

, learning correspondences using a pairwise loss function 

[35] or using generative adversarial networks to map simulated images to realistic looking ones [36]. In this work, similarly to [37] and [8], we explore directly transferring trained models to the real world with only small modifications. In contrast to their approaches, our policies also need to predict height displacements and grasp decisions which was found to be challenging.

Iii Approach

Iii-a Problem Formulation

We consider the combined task of reaching, grasping and lifting objects using a parallel-jaw gripper and a wrist-mounted depth camera. Our goal is to find a closed-loop policy through model-free rl that maps sensor measurements to end effector displacement and gripper controls. The input contains visual information captured from the depth camera and the current gripper opening width. To keep model sizes of our policy small, we first learn a lower-dimensional encoding of the depth images, which is then concatenated with the gripper width. This process is described in more details in Section III-E

. The 5-dimensional, continuous action vector


includes -translation and yaw rotation of the robotic hand as well as the gripper opening width . The movement of the hand is performed relative to the gripper’s frame, complementing our wrist-mounted camera setup. The translation vector is clipped to a maximum length of per step, requiring many iterations to finish the task and allowing the agent to react to dynamic changes in the environment. The gripper width command

is interpreted as a binary decision, with negative/positive values being mapped to a closed/opened hand respectively. In the reminder of this section, we are going to present the rl training process, agent model and different approaches we explored for speeding up training, including reward shaping, curriculum learning and transfer learning.

Iii-B Reinforcement Learning

Iii-B1 Background

Following [11], we model our rl problem as a discrete time, finite horizon mdp defined by the tuple , where denotes the set of admissible states, the set of valid actions, a real-valued reward function, the initial state distribution and

the (unknown) transition probability distribution. At each time step

, an rl agent observes the current state of its environment and decides to take an action according to a parameterized policy . The execution of this action causes the system to transition to a new state according to the system dynamics and the agent receives a reward . Episodes are terminated after a fixed number of steps or once a defined terminal state is reached. The goal of rl is to find parameters that maximize the return , where denotes a discount factor and the expectation is computed over the distribution of all possible trajectories with probabilities .

Iii-B2 Training Process

In our task of object picking, at the beginning of each episode, we sample from the initial state distribution by randomly selecting objects and placing them at a random pose within a workspace of size on a flat surface, where is the extent of the workspace. The number of objects is uniformly chosen between and for every new episode and the robot hand is placed pointing downwards at the center of the workspace with a distance between its finger tips and the surface.

We consider the outcome of an episode as a success and terminate if, within the time horizon of control steps, any object was lifted for . A natural reward function of this task would be a binary in case of success and otherwise. Such sparse rewards are difficult to learn from, requiring significant exploration. For this reason, in order to guide training, we also consider an alternative shaped reward formulation, in which the agent additionally receives intermediate reward signals for lifting objects,


with , , , and the difference in the robot’s height since the last step. The first term in the equation is a binary function that returns if a grasp was detected and otherwise. Grasp detection is achieved by checking if the fingers stalled after a closing command was issued. We also include a time penalty of and for the sparse and shaped reward functions respectively, where is the maximum allowed change in height per step. The latter is chosen such that rewards are shifted to negative values encouraging the agent to complete the task as quickly as possible.

Iii-C Workspace Curriculum

Limited prior knowledge allows model-free rl to be applied to a large set of tasks, but also renders exploration of interesting parts of the state space challenging. Particularly in manipulation tasks with large workspaces, the agent might waste significant training time exploring free space away from the objects. For this reason, following the formalism of Bengio et al. [1], we propose a curriculum of workspaces with increasing sizes to guide training of our agents.

Consider a sequence of training distributions, where the extent of the workspace , initial robot height , target lift distance , and maximum number of objects each increase linearly within a defined range with a variable , . A value of is mapped to the smallest possible value of each parameters and is mapped to their maximum value. For , we rounded to the nearest integer. The curriculum step is increased step-wise each time the success rate averaged over a window of recent episodes reaches a certain threshold . This ensures that the agent explores the state space close to the objects of interests in early stages of training while allowing to scale to large workspaces in later stages.

Iii-D Transfer Learning

For comparison, we also consider agents pre-trained on a simplified task formulation, similar to [22], that includes a heuristic to guide training. Robot arm movements are restricted to -translation and yaw rotation with the component fixed to a constant downward movement. Furthermore, the grasp decision is replaced by a heuristic that attempts a grasp once a given height threshold is reached. The reward function for this task is binary and equals to if an object was successfully lifted for , otherwise. Removing two dof significantly decreases the complexity of the task, but also limits the behavior of the learned policies. State-action pairs collected by executing an agent trained on this task can be augmented to be compatible with the original action description. Given this data, we use bc to train a policy predicting the full action space. The weights of this policy provide a warm start for further fine-tuning through rl.

Iii-E Agent Model

We separate the visual sensory and decision-making components of our agents. In a first stage, a perception network is trained to map image observations to a small-dimensional latent vector in an unsupervised manner. This network is then kept fixed and used to train a smaller policy to maximize the reward function described in III-B. Details of the different network architectures are depicted in Figure 2.

Fig. 2: Overview of the two networks used in this work. At each time step, the agent receives an image of its environment. The perception network generates a small-dimensional encoding, which is then concatenated with the current opening width of the gripper and passed to the control network to determine the next action. The former is trained in unsupervised manner and kept fixed throughout training of the control policy.

Perception Module

The goal of the perception network is to encode information about the shape, scale and distance of the objects in the scene into a low-dimensional latent vector

. In this work, we use a simple autoencoder. The encoder consists of 3 convolutional layers followed by a fully-connected layer using leaky ReLU non-linearities. The decoder mirrors the architecture of the encoder to reproduce a full-sized image. Using a low-dimensional bottleneck and training the parameters to minimize the L2 distance between the original and reconstructed images forces the encoder to learn a compressed representation of the input. The training set was collected by running a random policy on the simplified task described in the previous section. Since they are not relevant for our task, we filtered out the plane and gripper fingers from the images. Figure 


shows two examples of original, filtered, reconstructed and error images using an encoder trained on a dataset of 50000 images, using 120 epochs of the Adam optimizer 

[38], with a learning rate of , and batch size of . The same encoder weights were used throughout all experiments in this work.

Fig. 3: Two samples of images processed by our perception pipeline. From left to right, we show simulated RGB and filtered depth images, reconstructions produced by the autoencoder and the difference between originals and reconstructions.

Control Module

We use a small network that is trained separately from the perception network to map encoded observations to optimal actions. Policies are modeled as multivariate Gaussian distributions. A feed-forward neural network with two hidden layers and ReLU activations maps observations to the means of the distribution while the log-standard deviations are parameterized by a global, trainable vector. Actions are normalized to the range of

using a output non-linearity. Policy weights are optimized using trpo [39], a policy gradient method that performs stable updates by enforcing a constraint on the maximum change in policy distributions between two updates.

Iii-F Simulation

Collecting data using a dynamic simulation and synthetic depth images instead of a real system has several advantages: it is faster, scales better, has lower cost, there is no need for supervision, automatic reset of experiments is easy to implement, and full state information is available. For this reason, we focused on performing all training in simulation. We constructed a virtual world using the Bullet physics engine [40] and added a disentangled robot hand whose position is controlled via a force constraint, avoiding the computation of inverse kinematics. A virtual camera rendering images was placed to match the viewpoint of the real setup. Depth images were generated using a software-renderer bundled with the physics engine and filtering was performed using masks provided by the engine.

Iii-G Transfer to the Real Platform

We explore transferring policies trained in simulation to the real world without any fine-tuning of the network weights. Ideally, images captured from the real camera would only need to be resized and cropped to match the dimensions of the simulated camera and then be passed into the encoder. However, due to imperfect data and high noise levels, especially at the operating boundaries of the real sensor, some additional filtering was required. In particular, we noticed increasing noise and some curvature towards image boundaries, as well as high noise around the gripper’s fingers. For this reason, we applied an additional elliptic mask to filter out the borders and dilated masks of the gripper’s fingers. The surface was detected and filtered using a ransac [41] based approach.

Iv Evaluation

The goal of our experiments is to evaluate and compare training times and final performance of the proposed models, as well as assess their capability to react to dynamic changes and transfer to the real world.

Iv-a Experimental Setup

The platform used for evaluation consists of a position controlled 7-dof arm of an ABB Yumi with a maximal payload of . The fingers of a stock gripper with opening width of were rubber-covered for better grip and reducing reflection. A CamBoard pico flexx time of flight camera was attached to the wrist of the robot at a tilt angle of as seen in the top right image of Figure 1. In simulation, we used a model that matches the real robot and step the dynamics simulation with a size of which provided plausible physical behavior. Training was performed on a set of procedurally generated random objects with diverse shapes111https://sites.google.com/site/brainrobotdata/home/models. Following [22], we split the dataset into 900 train and 100 test models and the objects were scaled to fit into the smaller gripper used in this work. The grasping task was implemented on top of the OpenAI gym interface [42] and we based our implementation of trpo on Rllab [18]. Policy iterations were performed using a step size of and a batch size of and for the simplified/full task description respectively. A curriculum of eight sets of workspace parameters was used with values linearly increasing in the ranges reported in Table I. The curriculum step is increased once a threshold success rate averaged over the last 1000 episodes was reached during training.

Parameter Min. value Max. value
3 5
TABLE I: Parameter ranges for the curriculum sequences.

Iv-B Simulated Experiments

Iv-B1 Model comparison

Fig. 4: Learning curves for the different models analyzed in this work. Using a curriculum significantly speeds up training and leads to high final success rates that are comparable to the performance of an agent which was trained with a warm start provided by a grasping heuristic with fewer dof.

We analyze learning curves and the final performance of models trained on the full problem with only shaped rewards (shaped), and using the proposed curriculum with both shaped and sparse reward formulations (shaped/sparse, curriculum). We also include agents trained on the simplified task (sparse, simplified), the bc (sparse, bc) and warm-started policies (sparse, warm-start) described in Section III-D. Figure 4 shows success rates of training iterations against the number of environment interactions. For each model, we performed experiments with five different seeds and report the median, as well as the worst and best run, depicted as solid lines and shaded areas respectively. Surprisingly, even when training on the full workspace from the start, the algorithm manages to reinforce the occasional intermediate reward provided when agents interact with objects. However, results strongly varied over the seeds, with three out of the five runs failing altogether. Using a curriculum significantly speeds up learning as well as the final performance of the agents. We observe that the difference in the learning curves using the shaped and sparse rewards are quite small. This confirms that providing easily reachable goal states at early stages of training acts as a mean of guiding the agent and speeding up the training process without artificially shaping the reward function. Note that both of these models seem to stagnate temporally around steps 1 to . This is due to the agents repeatedly reaching a success rate of , triggering an increase in difficulty of the task until the curriculum parameters are set to their maximum values. This behavior is depicted in more details in Figure 5(c), plotting the history of the curriculum step along the history of success rates.

We can also observe that replacing two dof with a heuristic (sparse, simplified) results in a task that is considerably easier and faster to learn. However, policies seem to converge to a lower success rate. We found that a large fraction of these failure cases are due to a collision check, implemented to avoid the agent wasting time in case the robot stalled before reaching the low height threshold at which grasps are triggered. This, combined with a larger time horizon allowing the agent to recover from failed grasp attempts, explains the jump in performance when continuing training using the full action set (sparse, warm-start).

Model Simulation Real Robot
Single Object Clutter Table clearing (5) Table clearing (10) Single Object Clutter
success (%) success (%) success (%) % cleared success (%) % cleared success (%) success (%)
Shaped 74 80 54 62 64 59 - -
Shaped, curriculum 94 98 94 97 87 91 85 78
Sparse, curriculum 96 91 86 90 77 81 75 58
Sparse, bc 54 56 38 35 35 22 - -
Sparse, warm-start 90 98 77 86 76 78 90 70
TABLE II: Results of the simulated and real robot experiments.

We evaluate the final performance of the agents over three different tasks: picking a singulated object, picking any object out of a pile of five objects and, similarly to the experimental setup in [43], sequentially clearing objects from a flat surface until either all objects have been picked or the agent failed twice in a row. Success rates are averaged over 200 episodes for the first two tasks or 40 sequences for the table clearing task using the best performing agent of each model. For the latter, we additionally report the percentage of cleared objects. Also, in order to investigate if our model generalizes to a larger number of objects than seen during training, we perform the table clearing task with an initial number of five and ten objects. Comparisons are performed using the exact same sequence of object configurations and results are reported in Table II. We can see that using our curriculum formulation reaches even slightly higher success rates than the warm-start model. Generally, the latter performed well at properly aligning with objects, however the policies learned from scratch produced an interesting behavior, namely lifting the gripper after failed grasp attempts and in case that no object is within the current view, increasing chances of a successful grasp later in the episode. Contrary, the warm-started policy presented a strong bias to move the gripper downwards, following the heuristic used to collect data for pre-training. Pure bc lead to poorer performance, mainly due to the agent failing to close the hand once it’s aligned with the object. Fine-tuning this policy with increased standard deviation for exploration quickly remedied this flaw. The combination of curriculum with the shaped reward function was found to be the most effective.

Even though generally performance degrades, it is encouraging to see that the policies were able to cope with the larger number of objects present in the second table clearing experiment. The policies were also found to perform well over a range of different initial heights. All policies are closed-loop and react to changes in the object configuration and external perturbations. We refer to the accompanying video for an example of this behavior.

Iv-B2 Ablation study of the curriculum parameters

In order to analyze the importance of the individual parameters in the curriculum, we performed multiple experiments using both reward formulations, each time keeping one of the parameters fixed at its maximum value reported in Table I. Similarly to the previous model comparison, we performed 5 runs with different seeds for each setting and report the median learning curves in Figure 5. Fixing led to very similar, if not slightly improved, results compared to the full setting. This is not surprising, as more objects in the workspace increase chances of meaningful interaction. Fixing makes all runs fail in the binary reward case, as the probability of sequences that lead to final states becomes very small. In the presence of intermediate rewards for lifting the object, we observe that training still converges, but to lower success rates. Performing the entire training with a large initial robot height results in slower convergence, but still results in similar success rates in the case of the shaped reward function. Finally, having a large workspace extent has a surprisingly small effect on exploration, but leads to lower success rates in the long run.

Fig. 5: Ablation study of the workspace curriculum parameters. For each experiment, one parameter of the curriculum was kept fixed at its maximum value. Subfigures (a) and (b) show the learning curves using the sparse and shaped reward formulations respectively. Subfigure (c) shows the learning curve and the history of the curriculum step vs training steps for one run.

Iv-C Real-world Experiments

To evaluate the transfer from simulation to the real system, we perform real robot experiments on a set of 10 unseen objects, shown in Figure 1. Experiments were conducted using the best run of the shaped and sparse curriculum, and sparse, warm-start models, as they showed good performance in simulation. Figure 6 shows two sequences of the policy executed on a real robot. We use the same singulated object and clutter picking experiments described in the previous section, with the addition of considering any action that leads the robot to halt, e.g. due to too high joint torques, as failures. Objects are randomly placed in front of the robot by shuffling and placing the content of a box on a table and a total of 40 trials is performed for both tasks. The results are shown in Table II. We observe a notable drop in performance compared to the simulated experiments, which is due to a couple of reasons. First, high friction and approximate collision models used in simulation allowed some weak grasps, especially on the edges of objects, which fail in the real world. Second, some collisions that occurred while the gripper was interacting with the objects, especially in the cluttered scenes, lead to the activation of safety mechanisms. In this regard, the sparse, warm-start policy performed better than the sparse, curriculum model, which we believe is due to the collision check of the heuristic guiding the agent downwards, leading to zero reward. Lastly, some runs failed because of the agent prematurely closing the fingers when approaching objects, which can be explained by the still existing differences between real and simulated images, especially the high noise around the fingers. Generally, the shaped, curriculum model performed best, showing less collisions and more robust closing gripper decisions.

Fig. 6: Two examples of our policy running on the real platform. The first sequence of images shows how the robot sequentially clears objects from the table. In the second example, shortly before the first grasp attempt, the object is removed from the scene. As a response, the agent lifts the robot arm searching for an object and realigns with the object as soon as it is placed back on the table before successfully terminating the episode.

Iv-D Discussion and Limitations

Even though we observed worse performance on the real platform compared to simulated experiments, it is still encouraging that our policies achieved up to success rates in challenging picking tasks without any real robot data. We are also convinced that these numbers can be improved by learning more robust policies in simulation, as explored in other works, either through randomizing various parameters of the dynamics [44] and perception [32], including some adversary applying disturbances to the system [45, 46] or by fine-tuning on the real platform. Our translation actions result in jittery motions. We expect policies trained to predict velocity or force actions to result in smoother trajectories. The perception pipeline used in this work relied on the assumption that objects are placed on flat surfaces to perform the described filtering steps on the camera images. However, this is not always given and could lead to a failure of our system. Considering the wrist-mounted camera placement, following [8], we belief that this setup helps generalizing policies to different scenes, since mostly the relative pose between objects and the gripper is of interest for choosing the next action. However, sometimes it might be beneficial to rely on a top or over-the-shoulder view providing a better overview of the objects and scene around them.

V Conclusion

In this work, we presented a curriculum based approach to learn reactive policies for the task of object picking and compared this method against a formulation with shaped reward and cloning a heuristic with fewer dof. Curriculum learning allowed us to efficiently train policies using a natural sparse reward formulation and resulted in interesting behavior. However, we also found that including prior knowledge in the form of heuristics can help enforcing desired behavior in a more direct way. The learned policies achieved high success rates in simulated picking tasks, both for single objects and in clutter. We also deployed agents learned in simulation to a real robot and reported our findings.

In future work, we would like to initialize agents with policies gathered through human generated actions in an augmented reality setting and imitation learning. Additionally, it would be interesting to investigate the benefits and/or drawbacks of our separated network approach compared to a single convolutional neural network policy in more details. Finally, learning a hierarchy of policies for the different sub-tasks, e.g. reaching, grasping and lifting, might result in improved and more robust behavior.


We would like to thank Dario Mammolo for his help with the robot experiments. This work was supported in part by the Swiss National Science Foundation (SNF) through the National Centre of Competence in Research (NCCR) Digital Fabrication and the Luxembourg National Research Fund (FNR) 12571953.