"Good Robot!": Efficient Reinforcement Learning for Multi-Step Visual Tasks via Reward Shaping

by   Andrew Hundt, et al.

In order to learn effectively, robots must be able to extract the intangible context by which task progress and mistakes are defined. In the domain of reinforcement learning, much of this information is provided by the reward function. Hence, reward shaping is a necessary part of how we can achieve state-of-the-art results on complex, multi-step tasks. However, comparatively little work has examined how reward shaping should be done so that it captures task context, particularly in scenarios where the task is long-horizon and failure is highly consequential. Our Schedule for Positive Task (SPOT) reward trains our Efficient Visual Task (EVT) model to solve problems that require an understanding of both task context and workspace constraints of multi-step block arrangement tasks. In simulation EVT can completely clear adversarial arrangements of objects by pushing and grasping in 99 baseline in prior work. For random arrangements EVT clears 100 at 86 able to demonstrate context understanding and complete stacks in 74 compared to a baseline of 5 first instance of a Reinforcement Learning based algorithm successfully completing such a challenge. Code is available at https://github.com/jhu-lcsr/good_robot .



There are no comments yet.


page 1

page 3

page 5


Learning Robotic Manipulation Tasks through Visual Planning

Multi-step manipulation tasks in unstructured environments are extremely...

A Composable Specification Language for Reinforcement Learning Tasks

Reinforcement learning is a promising approach for learning control poli...

Keep it Simple: Unsupervised Simplification of Multi-Paragraph Text

This work presents Keep it Simple (KiS), a new approach to unsupervised ...

Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

Past research has proposed numerous hardware prefetching techniques, mos...

Exploring Dynamic Selection of Branch Expansion Orders for Code Generation

Due to the great potential in facilitating software development, code ge...

Reward Engineering for Object Pick and Place Training

Robotic grasping is a crucial area of research as it can result in the a...

C-Learning: Horizon-Aware Cumulative Accessibility Estimation

Multi-goal reaching is an important problem in reinforcement learning ne...

Code Repositories


“Good Robot!”: Efficient Reinforcement Learning for Multi-Step Visual Tasks with Sim to Real Transfer

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Multi-step tasks pose a significant challenge for robots, both because of their complexity and because it can be very easy to undo progress. Historically, this challenge required the use of a task planner, which represents the constraints of a task and searches for an efficient solution. Similarly, learning multi-step tasks requires an agent to understand each action within the context of a larger goal, but with limited knowledge of the state space. Whereas a motion planner constrains its search space according to a priori knowledge about the task space, an agent must learn whether each action contributes toward the end goal or reverses previous progress. We explore mechanisms for efficiently propagating this context information via Deep Reinforcement Learning (DRL) within the domain of block arrangement tasks.

We use the term context to denote information necessary to advance progress in a block stacking task (Fig. 1). For example, a learning algorithm has to discover that grasping a block from an existing stack is unproductive, and imprecisely grasping or placing a block entails a risk of toppling the stack. Likewise, an agent must learn about workspace constraints and develop and apply complex manipulation skills such as pushing or reorienting blocks (Fig. 2).

Fig. 1: Robot-created stacks and rows of cubes (left, bottom) plus an adversarial push and grasp scenario (top). Our Schedule for Positive Task Reward (SPOT) and Efficient Visual Task (EVT) network model allow us to efficiently find policies which can complete multi-step tasks. Video overview: https://youtu.be/p2iTSEJ-f_A

Discovering and effectively applying contextual knowledge is nontrivial, so behavior which demonstrates an understanding of context should be rewarded. Our proposed SPOT reward schedule (Sec. III-B) takes inspiration from a humane and effective approach to training pets sometimes called “Positive Conditioning”. Consider the goal of training a dog “Spot” to ignore an object or event she finds particularly interesting on command. In this practice, Spot is rewarded with treats whenever partial compliance with the desired end behavior is shown, and simply removed from harmful or regressive situations with zero treats (reward). One way to achieve this is to start with multiple treats in hand, place one treat in view of Spot, and if she eagerly jumps at the treat (a negative action) the human snatches and hides the treat immediately for zero reward on that action. With repetition, Spot will eventually hesitate, and so she is immediately praised with “Good Spot!” and gets a treat separate from the one she should ignore. This approach can be expanded to new situations and behaviors, plus it encourages exploration and rapid improvement once an initial partial success is achieved. As we describe in more detail below, our SPOT reward is likewise designed to provide neither reward nor punishment for actions which reverse progress.

Fig. 2: Temporal and workspace dependencies when stacking four blocks. Events at a current time can influence the likelihood of successful outcomes for past actions and future actions . A successful choice of action at any given will ensure both past and future actions are productive contributors to the larger task at hand, while failures indicate either a lack of progress or regression to an earlier stage.

No reward will be effective if the agent is unable to learn the task at hand in a reasonable amount of time. Thus, learning system design (in our case neural network design) goes hand in hand with reward design. In this work, we introduce a novel Efficient Visual Task (EVT) deep convolutional architecture for perception-based manipulation tasks. Furthermore, we demonstrate a novel Schedule for Positive Task (SPOT) reward enabling new capabilities for multi-step robotic tasks with Deep Reinforcement Learning.

Fig. 3: Our Efficient Visual Task (EVT) architecture. Images are pre-rotated before being passed to the network so that every coordinate in the output pixel-wise Q-Values corresponds to a final gripper position, orientation, and open loop action. Purple circles highlight the highest likelihood action

with an arrow to the corresponding height map coordinate, and point out how these values are transformed to a gripper pose. The rotated overhead views overlay the Q value at each pixel from dark blue values near 0 to red for high probabilities. Green arrows identify the same object across two oriented views. When EVT can see all objects it learns to prioritize lone blocks for grasp actions and to place it chooses locations over stacked blocks. Here EVT chooses to grasp the green block and place on the red + blue stack.

In summary, our contributions in this article are: (1) EVT, an efficient and accurate network model for visual tasks; (2) Reinforcement Learning of Multi-step robotic tasks combining model-based low level control, model-free learning of high level goals, and a progressive reward schedule; and (3) The Schedule for Positive Task (SPOT) reward for long-horizon robotic tasks. Combining the above enables DRL to train a network on “hard” multi-step tasks in a reasonable number of iterations.

Our grasping and placing Action Efficiency Error (Sec. IV-A) is 47% of the error found in previous work[23], while utilizing 49% of the computational resources. More importantly, we are able to complete long term multi-step tasks which are, to our knowledge, not viable with existing baseline algorithms.

Ii Related Work

The advent of deep learning has introduced novel approaches to robotic manipulation tasks like pushing and grasping. Within this space, our work confronts fundamental problems that arise in multi-step tasks like stacking or arranging blocks. We review these areas here in brief.

Deep Neural Networks (DNNs) in particular have enabled the use of raw images in robotic manipulation. In some approaches, the DNN’s output directly corresponds to motor commands, e.g. [14, 15]

. Higher level methods, on the other hand, assume a simple model for robotic control and focus on bounding box or pose detection for downstream grasp planning

[17, 25, 20, 5, 11, 10, 12, 16]. Increasingly, these methods benefit from the depth information provided by RGB-D sensors [23, 20, 16], which capture physical information about the workspace. However, the agent must still develop physical intuition, which recent work attempts in a more targeted setting. [13, 8]

, for instance, focus on block stacking by classifying simulated stacks as stable or likely to fall. Of these, the ShapeStacks dataset

[8] includes a larger variety of objects such as cylinders and spheres, as opposed to blocks alone. Similarly, [7, 3] develop physical intuition by predicting push action outcomes. Our work diverges from these approaches by developing visual understanding and physical intuition simultaneously, in concert with understanding progress in multi-step tasks.

When paired with DNNs, reinforcement learning has proven effective at increasingly complex tasks in robotic manipulation. An early approach in this space [14] directly coordinates RGB vision with servo motor control, learning tasks like unscrewing a bottle or using a hammer. Other methods focus on transferring visuo-motor skills from simulated to real robots [24, 26]

. Our work directs a low-level controller to perform actions rather than regressing torque vectors directly, following

[23, 22] by learning a pixel-wise success likelihood map.

Notably, VPG[23] is state of the art for RL-based table clearing tasks which can be trained within hours on a single robot from images. It is frequently able to complete adversarial scenarios like those in Fig. 4, by first pushing a tightly packed group and then grasping the now-separated objects. We utilize the VPG V-REP simulation and models as our baseline for comparison. VPG assumes an instantaneous reward delivery is sufficient to complete the task at hand. By contrast, we tackle multi-step tasks with sparse rewards which cannot be represented by VPG.

Multi-step tasks with sparse rewards present a challenge in reinforcement learning generally because they are less likely to be discovered through random exploration. This suggests demonstration is an effective method for guiding exploration [21, 2]. Within robotic manipulation, one approach separates a multi-step task into many modular sub-tasks comprising a sketch [1], while another separates the learning architecture into robot- and task-specific modules [4]. Our SPOT Reward, meanwhile, combines a novel reward schedule with a frequent reset policy to make early successes both more likely as well as more instructive.

Finally, neural architecture search forms the basis for our hyperparameter choices

[19, 6]. Neural networks are imperfect arbitrary function approximators, so a better choice of algorithm is an effective approach to improving deep learning based robotic manipulation algorithms, as we have detailed in past work[9].

Iii Approach

We formulate the problem of visual picking, pushing, and placing to complete a structure as a Partially Observable Markov Decision Process (POMDP)

, with state space , observation space , action space , transition , and reward function . At every time step , the robot chooses an action to take according to its policy , which results in a state transition to .

As in VPG [23], our goal is to learn a deterministic policy via Q-learning, which chooses the action at every such that . In our case, as in past work, we make the simplifying assumption that state can be identified from a single observation, in which case we can instead frame this as an MDP over observations . Formally, we learn by iteratively minimizing the temporal difference error of to a target , where:

In our case, consists of RGB-D heightmap images, as shown in Fig. 3. We capture these from a fixed camera, which are first converted into a point cloud and then projected so that is aligned with the direction of gravity. As per [23], our height maps cover an 0.448 table space. Images have a resolution of , meaning that each pixel represents roughly .

Our actions are represented as a 4-tuple . We will make two simplifying assumptions in order to handle pick and place tasks for assembly. First, we divide actions into set of high-level motion primitives , where . and represent 2D planar coordinates that parameterize this action primitive, and is an angle at which to position the gripper. Each also defines the appropriate gripper behavior. In practice we use discrete values for by passing multiple discrete rotations of the input heightmap into , again following VPG [23].

Iii-a Network Architecture

Fig. 3 shows our overall Efficient Visual Task (EVT) solution, including choice of actions, how the robot acts on the world, and the model architecture. EVT utilizes 2 EfficientNet-B0 models, one for grasping and placing, with weights shared between color and depth data. We modify the EfficientNet-B0 [19]

pretrained on ImageNet to an FCN 


by loading the final stride 2 convolution as a stride 1 convolution with a dilation rate of 2. This dilated convolution ensures more fine grained action choices are possible by doubling the effective

output resolution of the network.

The final dense grasp, push, and place blocks consist of a batchnorm and relu before each of two 1x1 convolutions with 2560 channels, i.e. [bn, relu, conv1x1, bn, relu, conv1x1], where a 1x1 convolution is equivalent to a dense layer at each pixel. These parameters are based on the final dense block structure optimized for accuracy via HyperTree Architecture Search 

[9] in our prior work. We note that efficiency was not considered in the HyperTree metric and as a result this pixel-wise dense block accounts for over 50% of the computation in EVT, so it is a good target for future efficiency gains. The push and grasp EVT architecture consumes 46B FLOPS and holds 24M parameters. The added dense block for placement actions brings this to 57B FLOPS and 30M parameters.

In our experiments we compare with VPG [23] which is the current state of the art DRL algorithm for clearing objects from a surface. It incorporates 4 separate Densenet-121 models for grasp RGB, grasp Depth, push RGB, and push Depth with a total of 94B floating point add-multiply operations (FLOPS). A push, grasp, and place configuration of VPG has 141.3B FLOPS and 48M parameters, which is 147% more FLOPS than EVT.

Iii-B SPOT Reward

We focus on learning neural network models in a manner which can solve multi-stage tasks with strong contextual and temporal dependencies between stages. As such, we assume there is a well-defined notion of progress throughout a given task. To this end, consider a set of normalized rewards with range . We generalize the baseline fixed reward definition of VPG [23] with the Exponential Reward Schedule (ExpRS):


For example, VPG’s fixed rewards of 0.5 and 1 can be represented with , is simply a reasonable choice of exponential base, is the current action iteration in a given trial. For example, a successful push gets score when we ignore future discounted rewards. Using these parameters we also chose an ExpRS place reward of 2 for multi-step tasks, and the variables we have defined here are a starting point for our SPOT reward below.

Our Schedule for Positive Task (SPOT) reward has two components: a linearly increasing sub-task “partial compliance” reward is delivered for actions which make progress on the task and a reward of 0 is delivered for actions which result in a reversal of progress. Formally, each reward is computed from different sub-tasks with a number associated with the current active sub-task . Examples of possible values for include one of [grasp=1, push=2, place=3] at a specific action in the sequence of actions during the trial, so varies depending on the action taken at a given time , and is . A current task progress depth indicates linear progress through a task, such as stack height. The range of possible reward values is defined between [,]. First, we wish to expand the skill set, and each mastered skill provides exponential increases in overall capability. For this reason we ensure rewards grow faster as more parts of a curriculum are mastered in the Positive Reward (PR) Component:


Second, as in our story of Spot the dog (Sec. I), we also wish to minimize disincentive for exploration without rewarding mistakes. This second component is called Situation Removal (SR), and is applied separately at each action time step:


Here is from eq. 2 and subtask rewards are given if the “partial compliance” subtask is successfully completed, and 0 otherwise. For stacking tasks we choose and the actions for which subtask rewards are delivered include [none, successful scene change, grasp success, place success] with values . Task depth is the stack height with possible values of . In addition to the instantaneous version of SPOT above, we also define a recursive SPOT Trial Reward for use during experience replay of previous full trials:


These values are recursively rolled out from the final action to the first action of a trial. The effect of this trial reward is that future rewards only propagate across time steps where subtasks are completed successfully.

Task Model Reward Scenarios Trials Grasp FLOPS Params Action
Schedule 100% Complete Completed B M Efficiency
Clear 10 Toys VPG[23] Exp (eq. 1) 100% 100% 68% 94 32 61%
Clear 10 Toys EVT Exp (eq. 1) 100% 100% 87% 46 24 82%
Clear 10 Toys EVT SPOT (eq. 4) 100% 100% 87% 46 24 86%
Clear Toys Adversarial VPG[23] Exp (eq. 1) 5/11 84% 77% 94 32 60%
Clear Toys Adversarial EVT Exp (eq. 1) 10/11 99% 62% 46 24 51%
TABLE I: Pushing and grasping results. The first task is to clear 10 toys for 100 trials with random arrangements, and the second is to clear 10 trials each across 11 adversarial arrangements with 110 total trials. Bold entries highlight our key algorithm improvements over the baseline.
Task Model Reward Tasks Grasp Place FLOPS Params Action
Schedule Completed B M Efficiency
Stack of 4 Cubes EVT Exp (eq. 1) 5% 84% 45% 57 30 4%
Stack of 4 Cubes EVT SPOT (eq. 3) 74% 93% 83% 57 30 63%
Row of 4 Cubes EVT SPOT (eq. 4) 92% 68% 61% 57 30 44%
TABLE II: Multi-Step task test success rates measured out of 100% for simulated tasks involving push, grasp and place actions. A lower Floating Point Operations (FLOPS) count indicates improved neural network computational efficiency. Bold entries highlight our key algorithm improvements over the baseline.

Iv Experiments

We conducted simulated experiments in the two baseline scenarios provided by VPG [23] with the same simulator and settings, as well as in two multi-step tasks of our own design. We present the results of these experiments in Table II, with descriptions and analysis below.

Iv-a Evaluation Metrics

We evaluate our algorithms in test cases with new random seeds and in accordance with the metrics found in VPG[23]. These include the percentage of successful grasps, placement action efficiency, and task completion rate. The completion rate is defined as the percentage of trials where the policy is able to successfully complete a task before the grasp or push action fails 10 consecutive times. Success of a push is when more than 300 pixels have changed in a scene. A successful grasp is counted when the gripper is in a partially open state after executing the open loop action, indicating an object is present in the gripper, and the closed gripper state indicates grasp failure. A successful place for stacking is evaluated more specifically by the z height of object origins when object poses are known. Alternately it can be awarded when height the highest vertical z height of a scene has increased by a minimum threshold. Ideal Action Efficiency is 100% and calculated as the . The Ideal Action Count is 1 action per object for grasping tasks, and for tasks which involve placement it is against 2 actions per object, where one object is assumed to remain stationary. This means 6 total actions for a stack of height 4 since only 3 objects must move.

Iv-B Baseline Scenarios

Clear 10 Toys: We establish a baseline via the primary simulated experiment found in VPG [23], where 10 toys with varied shapes must be grasped to clear the robot workspace. EVT reduces the Action Efficiency Error (1 - Action Efficiency) from 39% with VPG to 14% with EVT. We validate on 100 trials of 10 novel random object positions, where .

Table I(top) shows the results. As this is a fairly straightforward task, all methods were eventually able to complete it in the basic case, but we see that our proposed EVT model has the highest grasp success rate. In this case, we can also see SPOT does not make a meaningful difference – which makes intuitive sense, as the task structure is very simple.

Fig. 4: EVT Grasp Success Rate and Action Efficiency improvement during SPOT Push + Grasp Training. Higher is better.
Fig. 5: A Q-Value overlay on height map images of a successful EVT grasp action sequence “Clear Toys Adversarial” trial, with examples ordered in time from left to right. In this case a single extra grasp action was used to separate the packed blocks without any push actions. The light purple line indicates the gripper position and orientation. Video: https://youtu.be/F85d9xGCDnY

Clear Toys Adversarial: Our second more challenging baseline scenario (Fig. 5) contains 11 adversarial arrangements from prior work [23] where toys are placed in tightly packed configurations. We use the pretrained weights from the “Clear 10 Toys” task in scenarios the algorithm has never previously seen. The evaluation algorithm defined by VPG[23] is designed such that when no successful grasp is made for 10 consecutive actions the task is considered incomplete. We validate on 10 trials for each of the 11 challenge arrangements, and our model was able to clear the scene in 109 out of 110 trials. Table I(bottom) details the results: EVT performed substantially better. In EVT’s lone failure case the model had successfully separated the tightly packed blocks on the final 10th action without a successful grasp. It is reasonable to expect it would have finished clearing the scene in the next few actions had it not hit the incomplete task limit.

Curiously, while the rate of Tasks Completed rises, Action Efficiency is reduced in these challenging scenarios. Subjectively, this is due to the higher priority placed on grasping when compared to pushing as EVT attempts grasps in 89% of actions in these scenarios on average. In many cases the algorithm attempts a push only after several failed attempts at grasping, which finally frees up the blocks to complete the task, while in other cases it can separate blocks using grasps alone.

Iv-C Multi-Step Task Scenarios

We attempted to evaluate a direct extension of the VPG algorithm in which we simply add new DenseNet-121 place action models for RGB and Depth but the architecture exceeds the memory limits of our GTX 2080Ti GPUs (and the Titan X GPUs from VPG), illustrating limitations in the scalability of VPG when compared to our EVT architecture for new tasks.

Stack 4 Cubes: Our primary test task is to stack 4 cubes randomly placed within the scene. We ensure that workspace constraints are strictly observed by deeming any action in which a partial stack the robot has already assembled is subsequently toppled returns a reward of 0 and immediately ends the trial with the failure condition. This strict progress evaluation criteria ensures the scores indicate an understanding of the context surrounding the stack. A block can also occasionally tumble out of the workspace after a missed place action, which leaves no opportunity for recovery.

EVT is evaluated under the basic exponential reward schedule (eq. 1) and our SPOT Reward given in eq. 2, which accounts for task progress. Table II shows the results. As evidenced by the huge difference between SPOT and the baseline reward schedule, the SPOT Reward proves essential to task completion. EVT trained with SPOT succeeds 74% of the time, versus only 5% without it. This is because it is impossible to differentiate between placing one block on another single block vs stack of height 2 with the exponential reward curve.

Fig. 6: Learning Progress on the Stack 4 Cubes task. Failures include missed grasps, off-stack placements, and actions in which the stack topples. Higher is better.

Row of 4 Cubes: Our third test task evaluates the ability of the algorithm to generalize across tasks. Curiously, while making a row of 4 blocks appears the same as stacking, it is in fact much more difficult to train to complete efficiently. In particular, whereas with stacking optimal placement occurs on top of a strong visual feature—another block—the arrangement of blocks in rows depends on non-local visual features, i.e. the rest of the row. Additionally, every block in each row is available for grasping, which may reverse progress, as opposed to stacks where only the top block is readily available. This requires significant understanding of context, as we have described it, to accomplish. Table II shows significant progress on this challenging environment, succeeding 92% of the time. The higher overall task completion rate for rows when compared to stacks is in part due to reduced risk of a block tumbling out of the robot workspace.

V Conclusion

In spite of its obvious importance to results in reinforcement learning, reward shaping is one of the most under-explored areas in deep learning research for robotics. We have demonstrated an effective approach for training long-horizon tasks which present a high risk for reversing progress: the SPOT reward. To our knowledge, this is the first instance of reinforcement learning applied toward such a challenge. First, our EVT neural network model far exceeds existing methods’ computational efficiency for manipulation tasks while simultaneously providing a 20% increase in action efficiency and a 15% higher perfect completion rate for adversarial pushing and grasping scenarios. Our results show the continued importance of neural network architecture design choices for Robotics and Reinforcement Learning algorithms. Second, our SPOT Reward quantifies an agent’s progress within multi-step tasks while also providing zero-reward guidance, which we found necessary to achieve a 74% completion rate on a block stacking task and a 92% completion rate in a row creation task. Our work does assume some mechanism for intermediate rewards. Related methods which could help relax this assumption include inverse reinforcement learning of reward signals necessary to complete tasks, learning from demonstration, and metalearning. We expect that recent reinforcement learning algorithms beyond a Q function could also improve the efficiency of our algorithm. Nonetheless, we believe these principles are an effective approach for efficient learning on complex multi-step problems.

Finally, we note that our simulation executes in real time to ensure the viability of task transfer to a real robot as they did in the baseline VPG [23] experiments. It is our hope to demonstrate similar results on a physical testbed in the near future.


This material is based upon work supported by the NSF NRI Grant Award #1637949.


  • [1] J. Andreas, D. Klein, and S. Levine (2016-11) Modular Multitask Reinforcement Learning with Policy Sketches. ArXiv e-prints. External Links: 1611.01796 Cited by: §II.
  • [2] Y. Aytar, T. Pfaff, D. Budden, T. Paine, Z. Wang, and N. de Freitas (2018) Playing hard exploration games by watching youtube. In Advances in Neural Information Processing Systems, pp. 2935–2945. Cited by: §II.
  • [3] A. Byravan and D. Fox (2016) SE3-nets: learning rigid body motion using deep neural networks. arXiv preprint arXiv:1606.02378. Note: https://arxiv.org/abs/1606.02378 Cited by: §II.
  • [4] C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine (2017) Learning modular neural network policies for multi-task and multi-robot transfer. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2169–2176. Cited by: §II.
  • [5] B. Drost, M. Ulrich, N. Navab, and S. Ilic (2010) Model globally, match locally: efficient and robust 3d object recognition.. In CVPR, Vol. 1, pp. 5. Cited by: §II.
  • [6] T. Elsken, J. H. Metzen, and F. Hutter (2019) Neural architecture search: a survey..

    Journal of Machine Learning Research

    20 (55), pp. 1–21.
    External Links: Link Cited by: §II.
  • [7] C. Finn, I. Goodfellow, and S. Levine (2016-05) Unsupervised Learning for Physical Interaction through Video Prediction. ArXiv e-prints. External Links: 1605.07157 Cited by: §II.
  • [8] O. Groth, F. B. Fuchs, I. Posner, and A. Vedaldi (2018) ShapeStacks: learning vision-based physical intuition for generalised object stacking. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 702–717. Cited by: §II.
  • [9] A. Hundt, V. Jain, C. Lin, C. Paxton, and G. D. Hager (2019) The costar block stacking dataset: learning with workspace constraints. Intelligent Robots and Systems (IROS), 2019 IEEE International Conference on. External Links: Link Cited by: §II, §III-A.
  • [10] E. Jang, S. Vijayanarasimhan, P. Pastor, J. Ibarz, and S. Levine (2017) End-to-end learning of semantic grasping. In Conference on Robot Learning, pp. 119–132. External Links: Link Cited by: §II.
  • [11] S. Kumra and C. Kanan (2017-09)

    Robotic grasp detection using deep convolutional neural networks

    2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). External Links: ISBN 9781538626825, Link, Document Cited by: §II.
  • [12] I. Lenz, H. Lee, and A. Saxena (2015) Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34 (4-5), pp. 705–724. Note: Dataset:http://pr.cs.cornell.edu/grasping/rect External Links: Document, Link, https://doi.org/10.1177/0278364914549607 Cited by: §II.
  • [13] A. Lerer, S. Gross, and R. Fergus (2016) Learning physical intuition of block towers by example. International Conference on Machine Learning, pp. 430–438. Cited by: §II.
  • [14] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research 17 (1), pp. 1334–1373. Cited by: §II, §II.
  • [15] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen (2018) Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37 (4-5), pp. 421–436. Note: Dataset:https://sites.google.com/site/brainrobotdata/home External Links: Document, Link, https://doi.org/10.1177/0278364917710318 Cited by: §II.
  • [16] D. Morrison, J. Leitner, and P. Corke (2018-06) Closing the loop for robotic grasping: a real-time, generative grasp synthesis approach. Robotics: Science and Systems XIV. External Links: ISBN 9780992374747, Link, Document Cited by: §II.
  • [17] J. Redmon and A. Angelova (2014) Real-time grasp detection using convolutional neural networks. CoRR abs/1412.3128. External Links: Link, 1412.3128 Cited by: §II.
  • [18] E. Shelhamer, J. Long, and T. Darrell (2017) Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 640–651. External Links: Link Cited by: §III-A.
  • [19] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. External Links: Link Cited by: §II, §III-A.
  • [20] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2018)

    PoseCNN: a convolutional neural network for 6d object pose estimation in cluttered scenes

    In Robotics: Science and Systems (RSS), Vol. 14. External Links: Link Cited by: §II.
  • [21] D. Xu, S. Nair, Y. Zhu, J. Gao, A. Garg, L. Fei-Fei, and S. Savarese (2018) Neural task programming: learning to generalize across hierarchical tasks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §II.
  • [22] A. Zeng, S. Song, J. Lee, A. Rodriguez, and T. Funkhouser (2019) TossingBot: learning to throw arbitrary objects with residual physics. arXiv preprint arXiv:1903.11239. Cited by: §II.
  • [23] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser (2018) Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4238–4245. Cited by: §I, §II, §II, §II, §III-A, §III-B, TABLE I, §III, §III, §III, §IV-A, §IV-B, §IV-B, §IV, §V.
  • [24] F. Zhang, J. Leitner, M. Milford, and P. Corke (2016) Modular deep q networks for sim-to-real transfer of visuo-motor policies. Australasian Conference on Robotics and Automation (ACRA) 2017. External Links: Link Cited by: §II.
  • [25] H. Zhang, X. Zhou, X. Lan, J. Li, Z. Tian, and N. Zheng (2018) A real-time robotic grasp approach with oriented anchor box. arXiv preprint arXiv:1809.03873. External Links: Link Cited by: §II.
  • [26] Y. Zhu, Z. Wang, J. Merel, A. A. Rusu, T. Erez, S. Cabi, S. Tunyasuvunakool, J. Kramár, R. Hadsell, N. de Freitas, and N. Heess (2018)

    Reinforcement and imitation learning for diverse visuomotor skills

    In Robotics: Science and Systems XIV, Vol. 14. External Links: Link Cited by: §II.