In order for robots to assist us in everyday tasks, they need to be able to explore and interact with unstructured and dynamic environments found outside of traditional assembly lines and research labs. Robust manipulation of objects is a key component in all robotic applications that require interaction with their surroundings. To perform a task, the robot needs to be able to perceive the environment through its sensors, plan and execute the next action, and at the same time handle noisy data, external disturbances and real-world uncertainties. These challenges, together with the interdisciplinary nature of the problem, make this research area very complex. Traditional approaches usually use methods from computer vision to interpret the sensor data and then use an analytic approach (policy) to plan the next action. Manually designing policies that can cope with the complexity of the high-dimensional sensory input is difficult and often results in very tailored solutions to a given problem, which can be fragile to changes in the setup or task definition. Data-driven approaches, however, have shown to be powerful in such cases, given enough experience. rl is a general framework for training agents to acquire desired skills from trial-and-error by providing a reward for successful executions. It is able to find a complex mapping from a high-dimensional input space to the desired actions, without the need to explicitly model this relationship. However, depending on the complexity of the task, large amount of data might be required to learn the desired behaviour. In particular, the exploration phase can take a long time in the presence of large and continuous state and action spaces. To reduce the training time, the complexity of the manipulation task can be incrementally increased, thus allowing the learning algorithm to converge faster at each step. Simulators can be used as a less expensive and faster alternative to real-world data collection. However, transferring policies from simulation to the real world presents numerous difficulties, like sensor noise, contact physics approximations, simulated friction inaccuracies, etc., that can influence the final result on the real system.
In this work, we explore rl approaches to train agents that interact with their environment in a fully closed-loop manner in order to maximize future reward. We learn policies for the full task of reaching, grasping and lifting, which map depth images captured from a wrist-mounted camera to end effector displacements and gripper actions of a robotic arm, without relying on heuristics for the grasp decision. We explore different mechanisms to reduce training costs. First, we separate perception and control by learning a compressed image representation using the latent space of an autoencoder. Second, following the methodology of curriculum learning , we guide the training of our models by progressively increasing the workspace with the agent’s performance and compare this method against reward function shaping and using a heuristic to bootstrap the full problem. Finally, the entire training is performed in simulation and we report the required adjustments and findings from transferring policies to a real platform. In summary, the contributions of this paper are:
A closed-loop end-to-end formulation for the combined task of reaching, grasping and lifting different objects.
A case study of applying curriculum learning to guide training on this challenging task, including a comparison to alternative approaches.
A presentation of findings from transferring policies learned exclusively in simulation to a real-world table clearing task.
Ii Related Work
The task of reaching, grasping, and lifting can be solved using a large variety of approaches. First, we present a set, which we think is representative of general methods to solve this problem. In the second part, we highlight a selection of rl formulations, their robotic applications, and how they are used to partially or completely solve this manipulation task.
considers the problem of detecting grasp candidates that maximize the probability of success for a given environment and gripper configuration. Early approaches relied on geometric reasoning, often assuming knowledge of the shape and physical properties of the involved objects[2, 3]. Data-driven approaches on the other hand aim at learning models from labeled data that can exploit visual cues and generalize to unseen objects . Lenz et al.  trained a deep neural network on a small set of human-labeled images to predict the success of grasps on novel images. A different approach is to exploit analytic grasp theory to generate labeled data from synthetic point clouds [6, 7, 8] while a third line of research learns their models in a self-supervised manner directly from physical trials [9, 10]. Compared to the first two classes, the self-supervised approaches do not require any prior knowledge on grasp theory or human labeled samples. While some of the mentioned works improve robustness by iteratively recomputing the best grasp configuration [10, 8], they do not consider the long-term consequences of actions required to learn more complex behaviors. To enable an agent to learn such action sequences, one can pose this task as a rl problem.
Reinforcement learning [11, 12] is a general framework that considers autonomous agents who learn to choose sequences of control decisions that maximize some long-term measure of reward. To tackle the high-dimensional and continuous problems typically found in robotic applications, early works relied on task-specific, hand-engineered policy representations [13, 14]. Combining rl with the expressive power of deep neural networks has lead to some impressive results in various complex decision making problems [15, 16]. Due to the high data requirements, popular benchmarks often focus on video games  and simulated control problems [17, 18]. However, a number of works have applied rl to real-world manipulation tasks. One of the most notable ones is the Guided Policy Search , which trains a large neural network policy in a supervised manner on samples collected with trajectory-based rl. Other works tackle individual skills, such as opening doors  or lifting and stacking blocks . Our problem formulation is closest to Quillen et. al. , who compared different off-policy rl algorithms for bin-picking in clutter with a large set of training and unseen test objects. This work got extended  to include gripper actions and a decision variable on when to terminate an episode. In the last two approaches the training data is generated from a scripted policy. In contrast to their approaches, additionally to a heuristic policy initialization, we explore curriculum learning to make the problem tractable.
Sparse reward formulations are naturally suited for many goal-oriented manipulation tasks, but also create challenges leading to techniques such as augmenting reinforcement signals through reward shaping [20, 21], learning from expert demonstrations [13, 24, 25] and curriculum learning 
. The latter proposes to guide learning by presenting training samples in a meaningful order with increasing complexity and has been applied to supervised learning for sequence prediction and rl to acquire a curriculum of motor skills of an articulated figure . Akin to curriculum learning, Popov et al.  sample initial states along expert trajectories. Recent related work proposed to use a gan to automatically generate goals of increasing difficulty , generate start state distributions that gradually expand from a given goal state  and training a teacher to automatically choose samples for the learner . In our work, due to the large diversity of objects, goal states are not easily available. Therefore, our curriculum schedule increases both the space from which initial states are sampled as well as the final lifting height, where the target reward is awarded.
Opposed to using large-scale data collection on real robots [9, 10], , we perform training in simulation as a less expensive, faster and, safer alternative. However, deploying policies learned in simulation on a real system requires to bridge the reality gap induced by differences in sensing and dynamics, and is a very active field of research. One approach is to close the gap by making the simulation match the real system as closely as possible through system identification . Another approach is to expose the learning agent to a range of different environments through domain randomization, forcing it to learn a robust representation that generalizes to the real-world [32, 33]. Lastly, models can be adapted to new domains, e.g. by using progressive networks 
, learning correspondences using a pairwise loss function or using generative adversarial networks to map simulated images to realistic looking ones . In this work, similarly to  and , we explore directly transferring trained models to the real world with only small modifications. In contrast to their approaches, our policies also need to predict height displacements and grasp decisions which was found to be challenging.
Iii-a Problem Formulation
We consider the combined task of reaching, grasping and lifting objects using a parallel-jaw gripper and a wrist-mounted depth camera. Our goal is to find a closed-loop policy through model-free rl that maps sensor measurements to end effector displacement and gripper controls. The input contains visual information captured from the depth camera and the current gripper opening width. To keep model sizes of our policy small, we first learn a lower-dimensional encoding of the depth images, which is then concatenated with the gripper width. This process is described in more details in Section III-E
. The 5-dimensional, continuous action vector
includes -translation and yaw rotation of the robotic hand as well as the gripper opening width . The movement of the hand is performed relative to the gripper’s frame, complementing our wrist-mounted camera setup. The translation vector is clipped to a maximum length of per step, requiring many iterations to finish the task and allowing the agent to react to dynamic changes in the environment. The gripper width command
is interpreted as a binary decision, with negative/positive values being mapped to a closed/opened hand respectively. In the reminder of this section, we are going to present the rl training process, agent model and different approaches we explored for speeding up training, including reward shaping, curriculum learning and transfer learning.
Iii-B Reinforcement Learning
Following , we model our rl problem as a discrete time, finite horizon mdp defined by the tuple , where denotes the set of admissible states, the set of valid actions, a real-valued reward function, the initial state distribution and
the (unknown) transition probability distribution. At each time step, an rl agent observes the current state of its environment and decides to take an action according to a parameterized policy . The execution of this action causes the system to transition to a new state according to the system dynamics and the agent receives a reward . Episodes are terminated after a fixed number of steps or once a defined terminal state is reached. The goal of rl is to find parameters that maximize the return , where denotes a discount factor and the expectation is computed over the distribution of all possible trajectories with probabilities .
Iii-B2 Training Process
In our task of object picking, at the beginning of each episode, we sample from the initial state distribution by randomly selecting objects and placing them at a random pose within a workspace of size on a flat surface, where is the extent of the workspace. The number of objects is uniformly chosen between and for every new episode and the robot hand is placed pointing downwards at the center of the workspace with a distance between its finger tips and the surface.
We consider the outcome of an episode as a success and terminate if, within the time horizon of control steps, any object was lifted for . A natural reward function of this task would be a binary in case of success and otherwise. Such sparse rewards are difficult to learn from, requiring significant exploration. For this reason, in order to guide training, we also consider an alternative shaped reward formulation, in which the agent additionally receives intermediate reward signals for lifting objects,
with , , , and the difference in the robot’s height since the last step. The first term in the equation is a binary function that returns if a grasp was detected and otherwise. Grasp detection is achieved by checking if the fingers stalled after a closing command was issued. We also include a time penalty of and for the sparse and shaped reward functions respectively, where is the maximum allowed change in height per step. The latter is chosen such that rewards are shifted to negative values encouraging the agent to complete the task as quickly as possible.
Iii-C Workspace Curriculum
Limited prior knowledge allows model-free rl to be applied to a large set of tasks, but also renders exploration of interesting parts of the state space challenging. Particularly in manipulation tasks with large workspaces, the agent might waste significant training time exploring free space away from the objects. For this reason, following the formalism of Bengio et al. , we propose a curriculum of workspaces with increasing sizes to guide training of our agents.
Consider a sequence of training distributions, where the extent of the workspace , initial robot height , target lift distance , and maximum number of objects each increase linearly within a defined range with a variable , . A value of is mapped to the smallest possible value of each parameters and is mapped to their maximum value. For , we rounded to the nearest integer. The curriculum step is increased step-wise each time the success rate averaged over a window of recent episodes reaches a certain threshold . This ensures that the agent explores the state space close to the objects of interests in early stages of training while allowing to scale to large workspaces in later stages.
Iii-D Transfer Learning
For comparison, we also consider agents pre-trained on a simplified task formulation, similar to , that includes a heuristic to guide training. Robot arm movements are restricted to -translation and yaw rotation with the component fixed to a constant downward movement. Furthermore, the grasp decision is replaced by a heuristic that attempts a grasp once a given height threshold is reached. The reward function for this task is binary and equals to if an object was successfully lifted for , otherwise. Removing two dof significantly decreases the complexity of the task, but also limits the behavior of the learned policies. State-action pairs collected by executing an agent trained on this task can be augmented to be compatible with the original action description. Given this data, we use bc to train a policy predicting the full action space. The weights of this policy provide a warm start for further fine-tuning through rl.
Iii-E Agent Model
We separate the visual sensory and decision-making components of our agents. In a first stage, a perception network is trained to map image observations to a small-dimensional latent vector in an unsupervised manner. This network is then kept fixed and used to train a smaller policy to maximize the reward function described in III-B. Details of the different network architectures are depicted in Figure 2.
The goal of the perception network is to encode information about the shape, scale and distance of the objects in the scene into a low-dimensional latent vector
. In this work, we use a simple autoencoder. The encoder consists of 3 convolutional layers followed by a fully-connected layer using leaky ReLU non-linearities. The decoder mirrors the architecture of the encoder to reproduce a full-sized image. Using a low-dimensional bottleneck and training the parameters to minimize the L2 distance between the original and reconstructed images forces the encoder to learn a compressed representation of the input. The training set was collected by running a random policy on the simplified task described in the previous section. Since they are not relevant for our task, we filtered out the plane and gripper fingers from the images. Figure3
shows two examples of original, filtered, reconstructed and error images using an encoder trained on a dataset of 50000 images, using 120 epochs of the Adam optimizer, with a learning rate of , and batch size of . The same encoder weights were used throughout all experiments in this work.
We use a small network that is trained separately from the perception network to map encoded observations to optimal actions. Policies are modeled as multivariate Gaussian distributions. A feed-forward neural network with two hidden layers and ReLU activations maps observations to the means of the distribution while the log-standard deviations are parameterized by a global, trainable vector. Actions are normalized to the range ofusing a output non-linearity. Policy weights are optimized using trpo , a policy gradient method that performs stable updates by enforcing a constraint on the maximum change in policy distributions between two updates.
Collecting data using a dynamic simulation and synthetic depth images instead of a real system has several advantages: it is faster, scales better, has lower cost, there is no need for supervision, automatic reset of experiments is easy to implement, and full state information is available. For this reason, we focused on performing all training in simulation. We constructed a virtual world using the Bullet physics engine  and added a disentangled robot hand whose position is controlled via a force constraint, avoiding the computation of inverse kinematics. A virtual camera rendering images was placed to match the viewpoint of the real setup. Depth images were generated using a software-renderer bundled with the physics engine and filtering was performed using masks provided by the engine.
Iii-G Transfer to the Real Platform
We explore transferring policies trained in simulation to the real world without any fine-tuning of the network weights. Ideally, images captured from the real camera would only need to be resized and cropped to match the dimensions of the simulated camera and then be passed into the encoder. However, due to imperfect data and high noise levels, especially at the operating boundaries of the real sensor, some additional filtering was required. In particular, we noticed increasing noise and some curvature towards image boundaries, as well as high noise around the gripper’s fingers. For this reason, we applied an additional elliptic mask to filter out the borders and dilated masks of the gripper’s fingers. The surface was detected and filtered using a ransac  based approach.
The goal of our experiments is to evaluate and compare training times and final performance of the proposed models, as well as assess their capability to react to dynamic changes and transfer to the real world.
Iv-a Experimental Setup
The platform used for evaluation consists of a position controlled 7-dof arm of an ABB Yumi with a maximal payload of . The fingers of a stock gripper with opening width of were rubber-covered for better grip and reducing reflection. A CamBoard pico flexx time of flight camera was attached to the wrist of the robot at a tilt angle of as seen in the top right image of Figure 1. In simulation, we used a model that matches the real robot and step the dynamics simulation with a size of which provided plausible physical behavior. Training was performed on a set of procedurally generated random objects with diverse shapes111https://sites.google.com/site/brainrobotdata/home/models. Following , we split the dataset into 900 train and 100 test models and the objects were scaled to fit into the smaller gripper used in this work. The grasping task was implemented on top of the OpenAI gym interface  and we based our implementation of trpo on Rllab . Policy iterations were performed using a step size of and a batch size of and for the simplified/full task description respectively. A curriculum of eight sets of workspace parameters was used with values linearly increasing in the ranges reported in Table I. The curriculum step is increased once a threshold success rate averaged over the last 1000 episodes was reached during training.
|Parameter||Min. value||Max. value|
Iv-B Simulated Experiments
Iv-B1 Model comparison
We analyze learning curves and the final performance of models trained on the full problem with only shaped rewards (shaped), and using the proposed curriculum with both shaped and sparse reward formulations (shaped/sparse, curriculum). We also include agents trained on the simplified task (sparse, simplified), the bc (sparse, bc) and warm-started policies (sparse, warm-start) described in Section III-D. Figure 4 shows success rates of training iterations against the number of environment interactions. For each model, we performed experiments with five different seeds and report the median, as well as the worst and best run, depicted as solid lines and shaded areas respectively. Surprisingly, even when training on the full workspace from the start, the algorithm manages to reinforce the occasional intermediate reward provided when agents interact with objects. However, results strongly varied over the seeds, with three out of the five runs failing altogether. Using a curriculum significantly speeds up learning as well as the final performance of the agents. We observe that the difference in the learning curves using the shaped and sparse rewards are quite small. This confirms that providing easily reachable goal states at early stages of training acts as a mean of guiding the agent and speeding up the training process without artificially shaping the reward function. Note that both of these models seem to stagnate temporally around steps 1 to . This is due to the agents repeatedly reaching a success rate of , triggering an increase in difficulty of the task until the curriculum parameters are set to their maximum values. This behavior is depicted in more details in Figure 5(c), plotting the history of the curriculum step along the history of success rates.
We can also observe that replacing two dof with a heuristic (sparse, simplified) results in a task that is considerably easier and faster to learn. However, policies seem to converge to a lower success rate. We found that a large fraction of these failure cases are due to a collision check, implemented to avoid the agent wasting time in case the robot stalled before reaching the low height threshold at which grasps are triggered. This, combined with a larger time horizon allowing the agent to recover from failed grasp attempts, explains the jump in performance when continuing training using the full action set (sparse, warm-start).
|Single Object||Clutter||Table clearing (5)||Table clearing (10)||Single Object||Clutter|
|success (%)||success (%)||success (%)||% cleared||success (%)||% cleared||success (%)||success (%)|
We evaluate the final performance of the agents over three different tasks: picking a singulated object, picking any object out of a pile of five objects and, similarly to the experimental setup in , sequentially clearing objects from a flat surface until either all objects have been picked or the agent failed twice in a row. Success rates are averaged over 200 episodes for the first two tasks or 40 sequences for the table clearing task using the best performing agent of each model. For the latter, we additionally report the percentage of cleared objects. Also, in order to investigate if our model generalizes to a larger number of objects than seen during training, we perform the table clearing task with an initial number of five and ten objects. Comparisons are performed using the exact same sequence of object configurations and results are reported in Table II. We can see that using our curriculum formulation reaches even slightly higher success rates than the warm-start model. Generally, the latter performed well at properly aligning with objects, however the policies learned from scratch produced an interesting behavior, namely lifting the gripper after failed grasp attempts and in case that no object is within the current view, increasing chances of a successful grasp later in the episode. Contrary, the warm-started policy presented a strong bias to move the gripper downwards, following the heuristic used to collect data for pre-training. Pure bc lead to poorer performance, mainly due to the agent failing to close the hand once it’s aligned with the object. Fine-tuning this policy with increased standard deviation for exploration quickly remedied this flaw. The combination of curriculum with the shaped reward function was found to be the most effective.
Even though generally performance degrades, it is encouraging to see that the policies were able to cope with the larger number of objects present in the second table clearing experiment. The policies were also found to perform well over a range of different initial heights. All policies are closed-loop and react to changes in the object configuration and external perturbations. We refer to the accompanying video for an example of this behavior.
Iv-B2 Ablation study of the curriculum parameters
In order to analyze the importance of the individual parameters in the curriculum, we performed multiple experiments using both reward formulations, each time keeping one of the parameters fixed at its maximum value reported in Table I. Similarly to the previous model comparison, we performed 5 runs with different seeds for each setting and report the median learning curves in Figure 5. Fixing led to very similar, if not slightly improved, results compared to the full setting. This is not surprising, as more objects in the workspace increase chances of meaningful interaction. Fixing makes all runs fail in the binary reward case, as the probability of sequences that lead to final states becomes very small. In the presence of intermediate rewards for lifting the object, we observe that training still converges, but to lower success rates. Performing the entire training with a large initial robot height results in slower convergence, but still results in similar success rates in the case of the shaped reward function. Finally, having a large workspace extent has a surprisingly small effect on exploration, but leads to lower success rates in the long run.
Iv-C Real-world Experiments
To evaluate the transfer from simulation to the real system, we perform real robot experiments on a set of 10 unseen objects, shown in Figure 1. Experiments were conducted using the best run of the shaped and sparse curriculum, and sparse, warm-start models, as they showed good performance in simulation. Figure 6 shows two sequences of the policy executed on a real robot. We use the same singulated object and clutter picking experiments described in the previous section, with the addition of considering any action that leads the robot to halt, e.g. due to too high joint torques, as failures. Objects are randomly placed in front of the robot by shuffling and placing the content of a box on a table and a total of 40 trials is performed for both tasks. The results are shown in Table II. We observe a notable drop in performance compared to the simulated experiments, which is due to a couple of reasons. First, high friction and approximate collision models used in simulation allowed some weak grasps, especially on the edges of objects, which fail in the real world. Second, some collisions that occurred while the gripper was interacting with the objects, especially in the cluttered scenes, lead to the activation of safety mechanisms. In this regard, the sparse, warm-start policy performed better than the sparse, curriculum model, which we believe is due to the collision check of the heuristic guiding the agent downwards, leading to zero reward. Lastly, some runs failed because of the agent prematurely closing the fingers when approaching objects, which can be explained by the still existing differences between real and simulated images, especially the high noise around the fingers. Generally, the shaped, curriculum model performed best, showing less collisions and more robust closing gripper decisions.
Iv-D Discussion and Limitations
Even though we observed worse performance on the real platform compared to simulated experiments, it is still encouraging that our policies achieved up to success rates in challenging picking tasks without any real robot data. We are also convinced that these numbers can be improved by learning more robust policies in simulation, as explored in other works, either through randomizing various parameters of the dynamics  and perception , including some adversary applying disturbances to the system [45, 46] or by fine-tuning on the real platform. Our translation actions result in jittery motions. We expect policies trained to predict velocity or force actions to result in smoother trajectories. The perception pipeline used in this work relied on the assumption that objects are placed on flat surfaces to perform the described filtering steps on the camera images. However, this is not always given and could lead to a failure of our system. Considering the wrist-mounted camera placement, following , we belief that this setup helps generalizing policies to different scenes, since mostly the relative pose between objects and the gripper is of interest for choosing the next action. However, sometimes it might be beneficial to rely on a top or over-the-shoulder view providing a better overview of the objects and scene around them.
In this work, we presented a curriculum based approach to learn reactive policies for the task of object picking and compared this method against a formulation with shaped reward and cloning a heuristic with fewer dof. Curriculum learning allowed us to efficiently train policies using a natural sparse reward formulation and resulted in interesting behavior. However, we also found that including prior knowledge in the form of heuristics can help enforcing desired behavior in a more direct way. The learned policies achieved high success rates in simulated picking tasks, both for single objects and in clutter. We also deployed agents learned in simulation to a real robot and reported our findings.
In future work, we would like to initialize agents with policies gathered through human generated actions in an augmented reality setting and imitation learning. Additionally, it would be interesting to investigate the benefits and/or drawbacks of our separated network approach compared to a single convolutional neural network policy in more details. Finally, learning a hierarchy of policies for the different sub-tasks, e.g. reaching, grasping and lifting, might result in improved and more robust behavior.
We would like to thank Dario Mammolo for his help with the robot experiments. This work was supported in part by the Swiss National Science Foundation (SNF) through the National Centre of Competence in Research (NCCR) Digital Fabrication and the Luxembourg National Research Fund (FNR) 12571953.
Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,”
Proceedings of the 26th annual international conference on machine learning. ACM, 2009, pp. 41–48.
-  V.-D. Nguyen, “Constructing force-closure grasps,” The International Journal of Robotics Research, vol. 7, no. 3, pp. 3–16, 1988.
-  K. B. Shimoga, “Robot grasp synthesis algorithms: A survey,” The International Journal of Robotics Research, vol. 15, no. 3, pp. 230–266, 1996.
-  J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven grasp synthesis—a survey,” IEEE Transactions on Robotics, vol. 30, no. 2, pp. 289–309, 2014.
I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,”The International Journal of Robotics Research, vol. 34, no. 4-5, pp. 705–724, 2015.
-  M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt, “High precision grasp pose detection in dense clutter,” in Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016, pp. 598–605.
-  J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” in Robotics: Science and Systems (RSS), 2017.
-  U. Viereck, A. Pas, K. Saenko, and R. Platt, “Learning a visuomotor controller for real world robotic grasping using simulated depth images,” in Conference on Robot Learning, 2017, pp. 291–300.
-  L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 3406–3413.
-  S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with large-scale data collection,” in International Symposium on Experimental Robotics. Springer, 2016, pp. 173–184.
L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A
Journal of artificial intelligence research, vol. 4, pp. 237–285, 1996.
-  R. S. Sutton and A. G. Barto, Introduction to reinforcement learning. MIT press Cambridge, 1998, vol. 135.
-  J. Peters and S. Schaal, “Reinforcement learning of motor skills with policy gradients,” Neural networks, vol. 21, no. 4, pp. 682–697, 2008.
-  F. Stulp, E. A. Theodorou, and S. Schaal, “Reinforcement learning with sequences of motion primitives for robust manipulation,” IEEE Transactions on robotics, vol. 28, no. 6, pp. 1360–1370, 2012.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
-  Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International Conference on Machine Learning, 2016, pp. 1329–1338.
-  S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
-  S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 3389–3396.
-  I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. Barth-Maron, M. Vecerik, T. Lampe, Y. Tassa, T. Erez, and M. Riedmiller, “Data-efficient deep reinforcement learning for dexterous manipulation,” arXiv preprint arXiv:1704.03073, 2017.
-  D. Quillen, E. Jang, O. Nachum, C. Finn, J. Ibarz, and S. Levine, “Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods,” CoRR, vol. abs/1802.10264, 2018.
-  D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine, “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation,” ArXiv e-prints, Jun. 2018.
-  A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine, “Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations,” in Proceedings of Robotics: Science and Systems (RSS), 2018.
-  M. Pfeiffer, S. Shukla, M. Turchetta, C. Cadena, A. Krause, R. Siegwart, and J. Nieto, “Reinforced imitation: Sample efficient deep reinforcement learning for map-less navigation by leveraging prior demonstrations,” arXiv preprint arXiv:1805.07095, 2018.
S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” inAdvances in Neural Information Processing Systems, 2015, pp. 1171–1179.
-  A. Karpathy and M. Van De Panne, “Curriculum learning for motor skills,” in Canadian Conference on Artificial Intelligence. Springer, 2012, pp. 325–330.
-  D. Held, X. Geng, C. Florensa, and P. Abbeel, “Automatic goal generation for reinforcement learning agents,” arXiv preprint arXiv:1705.06366, 2017.
-  C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel, “Reverse curriculum generation for reinforcement learning,” in Conference on Robot Learning, 2017, pp. 482–495.
-  T. Matiisen, A. Oliver, T. Cohen, and J. Schulman, “Teacher-student curriculum learning,” arXiv preprint arXiv:1707.00183, 2017.
-  J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” CoRR, vol. abs/1804.10332, 2018.
-  J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017, pp. 23–30.
-  S. James, A. J. Davison, and E. Johns, “Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task,” in Conference on Robot Learning, 2017, pp. 334–343.
-  A. A. Rusu, M. Vecerik, T. Rothörl, N. Heess, R. Pascanu, and R. Hadsell, “Sim-to-real robot learning from pixels with progressive nets,” in Conference on Robot Learning, 2017, pp. 262–270.
-  E. Tzeng, C. Devin, J. Hoffman, C. Finn, X. Peng, S. Levine, K. Saenko, and T. Darrell, “Adapting deep visuomotor representations with weak pairwise constraints,” CoRR, vol. abs/1511.07111, 2015.
-  K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V. Vanhoucke, “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” CoRR, vol. abs/1709.07857, 2017.
-  E. Johns, S. Leutenegger, and A. J. Davison, “Deep learning a grasp function for grasping under gripper pose uncertainty,” in Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on. IEEE, 2016, pp. 4461–4468.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
-  J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.
-  E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2018.
-  M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
-  J. Mahler and K. Goldberg, “Learning deep policies for robot bin picking by simulating robust grasping sequences,” in Conference on Robot Learning, 2017, pp. 515–524.
-  X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” CoRR, vol. abs/1710.06537, 2017.
-  L. Pinto, J. Davidson, and A. Gupta, “Supervision via competition: Robot adversaries for learning tasks,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 1601–1608.
-  L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adversarial reinforcement learning,” in ICML, 2017.