Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching

10/03/2017 ∙ by Andy Zeng, et al. ∙ 0

This paper presents a robotic pick-and-place system that is capable of grasping and recognizing both known and novel objects in cluttered environments. The key new feature of the system is that it handles a wide range of object categories without needing any task-specific training data for novel objects. To achieve this, it first uses a category-agnostic affordance prediction algorithm to select among four different grasping primitive behaviors. It then recognizes picked objects with a cross-domain image classification framework that matches observed images to product images. Since product images are readily available for a wide range of objects (e.g., from the web), the system works out-of-the-box for novel objects without requiring any additional training data. Exhaustive experimental results demonstrate that our multi-affordance grasping achieves high success rates for a wide variety of objects in clutter, and our recognition algorithm achieves high accuracy for both known and novel grasped objects. The approach was part of the MIT-Princeton Team system that took 1st place in the stowing task at the 2017 Amazon Robotics Challenge. All code, datasets, and pre-trained models are available online at



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

Code Repositories


MIT-Princeton Vision Toolbox for Robotic Pick-and-Place at the Amazon Robotics Challenge 2017 - Grasp Detection and Image Matching for Novel Objects with Deep Learning

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A human’s remarkable ability to grasp and recognize unfamiliar objects with little prior knowledge of them is a constant inspiration for robotics research. This ability to grasp the unknown is central to many applications: from picking packages in a logistic center to bin-picking in a manufacturing plant; from unloading groceries at home to clearing debris after a disaster. The main goal of this work is to demonstrate that it is possible – and practical – for a robotic system to pick and recognize novel objects with very limited prior information about them (e.g. with only a few representative images scraped from the web).

Despite the interest of the research community, and despite its practical value, robust manipulation and recognition of novel objects in cluttered environments still remains a largely unsolved problem. Classical solutions for robotic picking require recognition and pose estimation prior to model-based grasp planning, or require object segmentation to associate grasp detections with object identities. These solutions tend to fall short when dealing with novel objects in cluttered environments, since they rely on 3D object models and/or large amounts of training data to achieve robust performance. Although there has been inspiring recent work on detecting grasps directly from RGB-D pointclouds as well as learning-based recognition systems to handle the constraints of novel objects and limited data, these methods have yet to be proven in the constraints and accuracy required by a real task with heavy clutter, severe occlusions, and object variability.

Fig. 1: Our picking system computing pixel-wise affordances for grasping over visual observations of bins full of objects (a), grasping a towel and holding it up away from clutter, and recognizing it by matching observed images of the towel (b) to an available representative product image. The entire system works out-of-the-box for novel objects (unseen in training) without the need for any additional data collection or re-training.

In this paper, we propose a system that picks and recognizes objects in cluttered environments. We have designed the system specifically to handle a wide range of objects novel to the system without gathering any task-specific training data for them. To make this possible, our system consists of two components. The first is a multi-affordance grasping framework which uses fully convolutional networks (FCNs) to take in visual observations of the scene and output a dense grid of values (arranged with the same size and resolution as the input data) measuring the affordance (or probability of picking success) for four different grasping primitive actions over a pixel-wise sampling of end effector orientations and locations. The primitive action with the highest inferred affordance value determines the grasping action executed by the robot. This grasping framework operates without a priori object segmentation and classification and hence is agnostic to object identity. The second component of the system is a cross-domain image matching framework for recognizing grasped objects by matching them to product images useing a two-stream convolutional network (ConvNet) architecture. This framework adapts to novel objects without additional re-training. Both components work hand-in-hand to achieve robust picking performance of novel objects in heavy clutter.

We provide exhaustive experiments and ablation studies to evaluate both components. We demonstrate that our affordance-based algorithm for grasp planning achieves high success rates for a wide variety of objects in clutter, and the recognition algorithm achieves high accuracy for known and novel grasped objects. These algorithms were developed as part of the MIT-Princeton Team system that took 1st place in the stowing task of the Amazon Robotics Challenge (ARC), being the only system to have successfully stowed all known and novel objects from an unstructured tote into a storage system within the allotted time frame. Fig. 1 shows our robot in action during the competition.

In summary, our main contributions are:

  • An object-agnostic grasping framework using four primitive grasping actions for fast and robust picking, utilizing fully convolutional networks for inferring the pixel-wise affordances of each primitive (Section IV).

  • A perception framework for recognizing both known and novel objects using only product images without extra data collection or re-training (Section V).

  • A system combining these two frameworks for picking novel objects in heavy clutter.

All code, datasets, and pre-trained models are available online at [1]. We also provide a video summarizing our approach at

Ii Related Work

In this section, we review works related to robotic picking systems. Works specific to grasping (Section IV) and recognition (Section V) are in their respective sections.

Ii-a Recognition followed by Model-based Grasping

A large number of autonomous pick-and-place solutions follow a standard two-step approach: object recognition and pose estimation followed by model-based grasp planning. For example, Jonschkowski et al. [2] designed object segmentation methods over handcrafted image features to compute suction proposals for picking objects with a vacuum. More recent data-driven approaches [3, 4, 5, 6]

use ConvNets to provide bounding box proposals or segmentations, followed by geometric registration to estimate object poses, which ultimately guide handcrafted picking heuristics

[7, 8]. Nieuwenhuisen et al. [9] improve many aspects of this pipeline by leveraging robot mobility, while Liu et al. [10] adds a pose correction stage when the object is in the gripper. These works typically require 3D models of the objects during test time, and/or training data with the physical objects themselves. This is practical for tightly constrained pick-and-place scenarios, but is not easily scalable to applications that consistently encounter novel objects, for which only limited data (i.e. product images from the web) is available.

Ii-B Recognition in parallel with Object-Agnostic Grasping

It is also possible to exploit local features of objects without object identity to efficiently detect grasps [11, 12, 13, 14, 15, 16, 17, 18, 19]. Since these methods are agnostic to object identity, they better adapt to novel objects and experience higher picking success rates by eliminating error propagation from a prior recognition step. Matsumoto et al. [20] apply this idea in a full picking system by using a ConvNet to compute grasp proposals, while in parallel inferring semantic segmentations for a fixed set of known objects. Although these pick-and-place systems use object-agnostic grasping methods, they still require some form of in-place object recognition in order to associate grasp proposals with object identities, which is particularly challenging when dealing with novel objects in clutter.

Fig. 2: The bin and camera setup. Our system consists of 4 units (top), where each unit has a bin with 4 stationary cameras: two overlooking the bin (bottom-left) are used for inferring grasp affordances while the other two (bottom-right) are used for recognizing grasped objects.

Ii-C Active Perception

Active perception – exploiting control strategies for acquiring data to improve perception [21, 22] – can facilitate the recognition of novel objects in clutter. For example, Jiang et al. [23] describe a robotic system that actively rearranges objects in the scene (by pushing) in order to improve recognition accuracy. Other works [24, 25] explore next-best-view based approaches to improve recognition, segmentation and pose estimation results. Inspired by these works, our system applies active perception by using a grasp-first-then-recognize paradigm where we leverage object-agnostic grasping to isolate each object from clutter in order to significantly improve recognition accuracy for novel objects.

Iii System Overview

We present a robotic pick-and-place system that grasps and recognizes both known and novel objects in cluttered environments. The “known” objects are provided to the system at training time, both as physical objects and as representative product images (images of objects available on the web); while the “novel” objects are provided only at test time in the form of representative product images.

Overall approach. The system follows a grasp-first-then-recognize work-flow. For each pick-and-place operation, it first uses FCNs to infer the pixel-wise affordances of four different grasping primitive actions: from suction to parallel-jaw grasps (Section IV). It then selects the grasping primitive action with the highest affordance, picks up one object, isolates it from the clutter, holds it up in front of cameras, recognizes its category, and places it in the appropriate bin. Although the object recognition algorithm is trained only on known objects, it is able to recognize novel objects through a learned cross-domain image matching embedding between observed images of held objects and product images (Section V).

Advantages. This system design has several advantages. First, the affordance-based grasping algorithm is model-free and agnostic to object identities and generalizes to novel objects without re-training. Second, the category recognition algorithm works without task-specific data collection or re-training for novel objects, which makes it scalable for applications in warehouse automation and service robots where the range of observed object categories is large and dynamic. Third, our grasping framework supports multiple grasping modes with a multi-functional gripper and thus handles a wide variety of objects. Finally, the entire processing pipeline requires only a few forward passes through deep networks and thus executes quickly (Table II).

Fig. 3: Multi-functional gripper with a retractable mechanism that enables quick and automatic switching between suction (pink) and grasping (blue).

System setup. Our system features a 6DOF ABB IRB 1600id robot arm next to four picking work-cells. The robot arm’s end-effector is a multi-functional gripper with two fingers for parallel-jaw grasps and a retractable suction cup (Fig. 3). This gripper was designed to function in cluttered environments: finger and suction cup length are specifically chosen such that the bulk of the gripper body does not need to enter the cluttered space. Each work-cell has a storage bin and four statically-mounted RealSense SR300 RGB-D cameras (Fig. 2): two cameras overlooking the storage bins are used to infer grasp affordances, while the other two pointing towards the robot gripper are used to recognize objects in the gripper. Although our experiments were performed with this setup, the system was designed to be flexible for picking and placing between any number of reachable work-cells and camera locations. Furthermore, all manipulation and recognition algorithms in this paper were designed to be easily adapted to other system setups.

Fig. 4: Multiple motion primitives for suction and grasping to ensure successful picking for a wide variety of objects in any orientation.

Iv Multi-Affordance Grasping

The goal of the first step in our system is to robustly grasp objects from a cluttered scene without relying on their object identities or poses. To this end, we define a set of four grasping primitive actions that are complementary to each other in terms of utility across different object types and scenarios – empirically maximizing the variety of objects and orientations that can be picked with at least one primitive. Given RGB-D images of the cluttered scene at test time, we infer the dense pixel-wise affordances for all four primitives. A task planner then selects and executes the primitive with the highest affordance (more details of this planner can be found in the Appendix).

Fig. 5: Learning pixel-wise affordances for suction and grasping. Given multi-view RGB-D images, we infer pixel-wise suction affordances for each image with an FCN. The inferred affordance value at each pixel describes the utility of suction at that pixel’s projected 3D location. We aggregate the inferred affordances onto a 3D point cloud, where each point corresponds to a suction proposal (down or side based on surface normals). In parallel, we merge RGB-D images into an orthographic RGB-D heightmap of the scene, rotate it by 16 different angles, and feed them each through another FCN to estimate the pixel-wise affordances of horizontal grasps for each heightmap. This effectively produces affordance maps for 16 different top-down grasping angles, from which we generate grasp down and flush grasp proposals. The suction or grasp proposal with the highest affordance value is executed.

Iv-a Grasping Primitives

We define four grasping primitives to achieve robust picking for typical household objects. Fig. 4 shows example motions for each primitive. Each of them are implemented as a set of guarded moves, with collision avoidance and quick success or failure feedback mechanisms: for suction, this comes from flow sensors; for grasping, this comes from contact detection via force feedback from sensors below the work-cell. Robot arm motion planning is automatically executed within each primitive with stable IK solves [26]. These primitives are as follows:

Suction down grasps objects with a vacuum gripper vertically. This primitive is particularly robust for objects with large and flat suctionable surfaces (e.g. boxes, books, wrapped objects), and performs well in heavy clutter.

Suction side grasps objects from the side by approaching with a vacuum gripper tilted an an angle. This primitive is robust to thin and flat objects resting against walls, which may not have suctionable surfaces from the top.

Grasp down grasps objects vertically using the two-finger parallel-jaw gripper. This primitive is complementary to the suction primitives in that it is able to pick up objects with smaller, irregular surfaces (e.g. small tools, deformable objects), or made of semi-porous materials that prevent a good suction seal (e.g. cloth).

Flush grasp retrieves unsuctionable objects that are flushed against a wall. The primitive is similar to grasp down, but with the additional behavior of using a flexible spatula to slide one finger in between the target object and the wall.

Iv-B Learning Affordances with Fully Convolutional Networks

Given the set of pre-defined grasping primitives and RGB-D images of the scene, we train FCNs [27] to infer the affordances for each primitive across a dense pixel-wise sampling of end-effector orientations and locations (i.e. each pixel correlates to a different position on which to execute the primitive). Our approach relies on the assumption that graspable regions can be deduced from the local geometry and material properties, as reflected in visual information. This is inspired by recent data-driven methods for grasp planning [11, 12, 13, 15, 16, 17, 18, 19], which do not rely on object identities or state estimation.

Inferring Suction Affordances. We define suction points as 3D positions where the vacuum gripper’s suction cup should come in contact with the object’s surface in order to successfully grasp it. Good suction points should be located on suctionable (e.g. nonporous) surfaces, and nearby the target object’s center of mass to avoid an unstable suction seal (e.g. particularly for heavy objects). Each suction proposal is defined as a suction point, its local surface normal (computed from the projected 3D point cloud), and its affordance value. Each pixel of an RGB-D image (with a valid depth value) maps surjectively to a suction point.

We train a fully convolutional residual network (ResNet-101 [28]), that takes a RGB-D image as input, and outputs a densely labeled pixel-wise map (with the same image size and resolution as the input) of affordance values between 0 and 1. Values closer to one imply a more preferable suction location. Visualizations of these densely labeled affordance maps are shown as heat maps in the first row of Fig. 5

. Our network architecture is multi-modal, where the color data (RGB) is fed into one ResNet-101 tower, and 3-channel depth (DDD, cloned across channels, normalized by subtracting mean and dividing by standard deviation) is fed into another ResNet-101 tower. Features from the ends of both towers are concatenated across channels, followed by 3 additional spatial convolution layers to merge the features; then spatially bilinearly upsampled and softmaxed to output a binary probability map representing the inferred affordances.

Our FCN is trained over a manually annotated dataset of RGB-D images of cluttered scenes with diverse objects, where pixels are densely labeled either positive, negative, or neither. Pixel regions labeled as neither are trained with 0 loss backpropagation. We train our FCNs by stochastic gradient descent with momentum, using fixed learning rates of

and momentum of 0.99. Our models are trained in Torch/Lua with an NVIDIA Titan X on an Intel Core i7-3770K clocked at 3.5 GHz.

During testing, we feed each captured RGB-D image through our trained network to generate dense suction affordances for each view of the scene. As a post-processing step, we use calibrated camera intrinsics and poses to project the RGB-D data and aggregate the affordances onto a combined 3D point cloud. We then compute surface normals for each 3D point (using a local region around it), which are used to classify which suction primitive (down or side) to use for the point. To handle objects without depth, we use a simple hole filling algorithm

[29] on the depth images, and project inferred affordance values onto the hallucinated depth. We filter out suction points from the background by performing background subtraction [4] between the captured RGB-D image of the scene with objects and an RGB-D image of the scene without objects (captured automatically before any objects are placed into the picking work-cells).

Fig. 6: Recognition framework for novel objects.

We train a two-stream convolutional neural network where one stream computes 2048-dimensional feature vectors for product images while the other stream computes 2048-dimensional feature vectors for observed images, and optimize both streams so that features are more similar for images of the same object and dissimilar otherwise. During testing, product images of both known and novel objects are mapped onto a common feature space. We recognize observed images by mapping them to the same feature space and finding the nearest neighbor match.

Inferring Grasp Affordances. Grasp proposals are represented by 1) a 3D position which defines the middle point between the two fingers during top-down parallel-jaw grasping, 2) an angle which defines the orientation of the gripper around the vertical axis along the direction of gravity, 3) the width between the gripper fingers during the grasp, and 4) its affordance value.

Two RGB-D views of the scene are aggregated into a registered 3D point cloud, which is then orthographically back-projected upwards in the gravity direction to obtain a “heightmap” image representation of the scene with both color (RGB) and height-from-bottom (D) channels. Each pixel of the heightmap represents a mm vertical column of 3D space in the scene. Each pixel also correlates bijectively to a grasp proposal whose 3D position is naturally computed from the spatial 2D position of the pixel relative to the heightmap image and the height value at that pixel. The gripper orientation of the grasp proposal is horizontal with respect to the frame of the heightmap.

Analogous to our deep network inferring suction affordances, we feed this RGB-D heightmap as input to a fully convolutional ResNet-101 [28], which densely infers affordance values (between 0 and 1) for each pixel – thereby for all top-down parallel-jaw grasping primitives executed with a horizontally orientated gripper across all 3D locations in heightmap of the scene sampled at pixel resolution. Visualizations of these densely labeled affordance maps are also shown as heat maps in the second row of Fig. 5. By rotating the heightmap of the scene with different angles prior to feeding as input to the FCN, we can account for different gripper orientations around the vertical axis. For our system ; hence we compute affordances for all top-down parallel-jaw grasping primitives with forward passes of our FCN to generate output affordance maps.

We train our FCN over a manually annotated dataset of RGB-D heightmaps, where each positive and negative grasp label is represented by a pixel on the heightmap as well as an angle correlated to the orientation of gripper. We trained this FCN with the same optimization parameters as that of the FCN used for inferring suction affordances.

During post-processing, the width between the gripper fingers for each grasp proposal is determined by using the local geometry of the 3D point cloud. We also use the location of each proposal relative to the bin to classify which grasping primitive (down or flush) should be used: flush grasp is executed for pixels located near the sides of the bins; grasp down is executed for all other pixels. To handle objects without depth, we triangulate no-depth regions in the heightmap using both RGB-D camera views of the scene, and fill in these regions with synthetic height values of 3cm prior to feeding into the FCN. We filter out inferred grasp proposals in the background by using background subtraction with the RGB-D heightmap of an empty work-cell.

V Recognizing Novel Objects

After successfully grasping an object and isolating it from clutter, the goal of the second step in our system is to recognize the identity of the grasped object.

Since we encounter both known and novel objects, and we have only product images for the novel objects, we address this recognition problem by retrieving the best match among a set of product images. Of course, observed images and product images can be captured in significantly different environments in terms of lighting, object pose, background color, post-process editing, etc. Therefore, we require an algorithm that is able to find the semantic correspondences between images from these two different domains. While this is a task that appears repeatedly in a variety of research topics (e.g. domain adaptation, one-shot learning, meta-learning, visual search, etc.), in this paper we simply refer to it as a cross-domain image matching problem [30, 31, 32].

V-a Metric Learning for Cross-Domain Image Matching

To perform the cross-domain image matching between observed images and product images, we learn a metric function that takes in an observed image and a candidate product image and outputs a distance value that models how likely the images are of the same object. The goal of the metric function is to map both the observed image and product image onto a meaningful feature embedding space so that smaller feature distances indicate higher similarities. The product image with the smallest metric distance to the observed image is the final matching result.

We model this metric function with a two-stream convolutional neural network (ConvNet) architecture where one stream computes features for the observed images, and a different stream computes features for the product images. We train the network by feeding it a balanced 1:1 ratio of matching and non-matching image pairs (one observed image and one product image) from the set of known objects, and backpropagate gradients from the distance ratio loss (Triplet loss [33]). This effectively optimizes the network in a way that minimizes the distances between features of matching pairs while pulling apart the distances between features of non-matching pairs. By training over enough examples of these image pairs across known objects, the network learns a feature embedding that encapsulates object shape, color, and other visual discriminative properties, which can generalize and be used to match observed images of novel objects to their respective product images (Fig. 6).

Avoiding metric collapse by guided feature embeddings.

One issue commonly encountered in metric learning occurs when the number of training object categories is small – the network can easily overfit its feature space to capture only the small set of training categories, making generalization to novel object categories difficult. We refer to this problem as metric collapse. To avoid this issue, we use a model pre-trained on ImageNet

[34] for the product image stream and train only the stream that computes features for observed images. ImageNet contains a large collection of images from many categories, and models pre-trained on it have been shown to produce relatively comprehensive and homogenous feature embeddings for transfer tasks [35] – i.e. providing discriminating features for images of a wide range of objects. Our training procedure trains the observed image stream to produce features similar to the ImageNet features of product images – i.e., it learns a mapping from observed images to ImageNet features. Those features are then suitable for direct comparison to features of product images, even for novel objects not encountered during training.

Using multiple product images. For many applications, there can be multiple product images per object. However, with multiple product images, supervision of the two-stream network can become confusing - on which pair of matching observed and product images should the backpropagated gradients be based? To solve this problem, we add a module we call a “multi-anchor switch” in the network. During training, this module automatically chooses which “anchor” product image to compare against based on nearest neighbor distance. We find that allowing the network to select its own criterion for choosing “anchor” product images provides a significant boost in performance in comparison to alternative methods like random sampling.

V-B Two Stage Framework for a Mixture of Known and Novel Objects

In settings where both types of objects are present, we find that training two different network models to handle known and novel objects separately can yield higher overall matching accuracies. One is trained to be good at “over-fitting” to the known objects (K-net) and the other is trained to be better at “generalizing” to novel objects (N-net).

Yet, how do we know which network to use for a given image? To address this issue, we execute our recognition pipeline in two stages: a “recollection” stage that determines whether the observed object is known or novel, and a “hypothesis” stage that uses the appropriate network model based on the first stage’s output to perform image matching.

First, the recollection stage infers whether the input observed image from test time is that of a known object that has appeared during training. Intuitively, an observed image is of a novel object if and only if its deep features cannot match to that of any images of known objects. We explicitly model this conditional by thresholding on the nearest neighbor distance to product image features of known objects. In other words, if the

distance between the K-net features of an observed image and the nearest neighbor product image of a known object is greater than some threshold k, then the observed images is a novel object.

In the hypothesis stage, we perform object recognition based on one of two network models: K-net for known objects and N-net for novel objects. The K-net and N-net share the same network architecture. However, the K-net has an additional auxiliary classification loss during training for the known objects. This classification loss increases the accuracy of known objects at test time to near perfect performance, and also boosts up the accuracy of the recollection stage, but fails to maintain the accuracy of novel objects. On the other hand, without the restriction of the classification loss, N-net has a lower accuracy for known objects, but maintains a better accuracy for novel objects.

By adding the recollection stage, we can exploit both the high accuracy of known objects with K-net and good accuracy of novel objects with N-net, though incurring a cost in accuracy from erroneous known vs novel classification. We find that this two stage system overall provides higher total matching accuracy for recognizing both known and novel objects (mixed) than all other baselines (Table III).

Vi Experiments

In this section, we evaluate our affordance-based grasping framework, our recognition algorithm over both known and novel objects, as well as our full system in the context of the Amazon Robotics Challenge 2017.

Vi-a Evaluating Multi-affordance Grasping

Datasets. To generate datasets for learning affordance-based grasping, we designed a simple labeling interface that prompts users to manually annotate suction and grasp proposals over RGB-D images collected from the real system. For suction, users who have had experience working with our suction gripper are asked to annotate pixels of suctionable and non-suctionable areas on raw RGB-D images overlooking cluttered bins full of various objects. Similarly, users with experience using our parallel-jaw gripper are asked to sparsely annotate positive and negative grasps over re-projected heightmaps of cluttered bins, where each grasp is represented by a pixel on the heightmap and an angle corresponding to the orientation (parallel-jaw motion) of the gripper. On the interface, users directly paint labels on the images with wide-area circular (suction) or rectangular (grasping) brushstrokes. The diameter and angle of the strokes can be adjusted with hotkeys. The color of the strokes are green for positive labels and red for negative labels. Examples of images and labels from this dataset can be found in Fig. 7 of the Appendix. During training, we further augment each grasp label by adding additional labels via small jittering (less than 1.6cm). In total, the dataset contains 1837 RGB-D images with suction and grasp labels. We use a 4:1 training/testing split across this dataset to train and evaluate different models.


In the context of our grasping framework, a method is robust if it is able to consistently find at least one suction or grasp proposal that works. To reflect this, our evaluation metric is the precision of inferred proposals versus manual annotations. For suction, a proposal is considered a true positive if its pixel center is manually labeled as a suctionable area (false positive if manually labeled as an non-suctionable area). For grasping, a proposal is considered a true positive if its pixel center is nearby within 4 pixels and 11.25 degrees from a positive grasp label (false positive if nearby a negative grasp label).

We report the precision of our inferred proposals for different confidence percentiles in Table I

. The precision of the top-1 proposal is reliably above 90% for both suction and grasping. We further compare our methods to heuristic-based baseline algorithms that compute suction affordances by estimating surface normal variance over the observed 3D point cloud (lower variance = higher affordance), and computes anti-podal grasps by detecting hill-like geometric structures in the 3D point cloud. Baselines details and code are available on our project webpage


Primitive Method Top-1 Top 1% Top 5% Top 10%
Suction Baseline 35.2 55.4 46.7 38.5
ConvNet 92.4 83.4 66.0 52.0
Grasping Baseline 92.5 90.7 87.2 73.8
ConvNet 96.7 91.9 87.6 84.1

% precision of grasp proposals across different confidence percentiles.

TABLE I: Multi-affordance Grasping Performance

Speed. Our suction and grasp affordance algorithms were designed to achieve fast run-time speeds during test time by densely inferring affordances over images of the entire scene. In Table II

, we compare our run-time speeds to several state-of-the-art alternatives for grasp planning. Our own numbers measure the time of each FCN forward pass, reported with an NVIDIA Titan X on an Intel Core i7-3770K clocked at 3.5 GHz, excluding time for image capture and other system-related overhead. Our FCNs run at a fraction of the time required by most other methods, while also being significantly deeper (with 101 layers) than all other deep learning methods.

Vi-B Evaluating Novel Object Recognition

We evaluate our recognition algorithms using a 1 vs 20 classification benchmark. Each test sample in the benchmark contains 20 possible object classes, where 10 are known and 10 are novel, chosen at random. During each test sample, we feed the recognition algorithm the product images for all 20 objects as well as an observed image of a grasped object. In Table III, we measure performance in terms of average % accuracy of the top-1 nearest neighbor product image match of the grasped object. We evaluate our method against a baseline algorithm, a state-of-the-art network architecture for both visual search [32] and one-shot learning without retraining [36], and several variations of our method. The latter provides an ablation study to show the improvements in performance with every added component:

Method Time
Lenz et al. [12] 13.5
Zeng et al. [4] 10 - 15
Hernandez et al. [3] 5 - 40 a
Schwarz et al. [5] 0.9 - 3.3
Dex-Net 2.0 [17] 0.8
Matsumoto et al. [20] 0.2
Redmon et al. [13] 0.07
Ours (suction) 0.06
Ours (grasping) 0.05 b

a times reported from [20] derived from [3].
b = number of possible grasp angles.

TABLE II: Grasp Planning Run-Times (sec.)

Nearest neighbor is a baseline algorithm where we compute features of product images and observed images using a ResNet-50 pre-trained on ImageNet, and use nearest neighbor matching with distance.

Siamese network with weight sharing is a re-implementation of Bell et al. [32] for visual search and Koch et al. [36] for one shot recognition without retraining. We use a Siamese ResNet-50 pre-trained on ImageNet and optimized over training pairs in a Siamese fashion. The main difference between this method and ours is that the weights between the networks computing deep features for product images and observed images are shared.

Two-stream network without weight sharing is a two-stream network, where the networks’ weights for product images and observed images are not shared. Without weight sharing the network has more flexibility to learn the mapping function and thus achieves higher matching accuracy. All the later models describe later in this section use this two stream network without weight sharing.

Two-stream + guided-embedding (GE) includes a guided feature embedding with ImageNet features for the product image stream. We find this model has better performance for novel objects than for known objects.

Two-stream + guided-embedding (GE) + multi-product-images (MP) By adding a multi-anchor switch, we see more improvements to accuracy for novel objects. This is the final network architecture for N-net.

Two-stream + guided-embedding (GE) + multi-product-images (MP) + auxiliary classification (AC) By adding an auxiliary classification, we achieve near perfect accuracy of known objects for later models, however, at the cost of lower accuracy for novel objects. This also improves known vs novel (K vs N) classification accuracy for the recollection stage. This is the final network architecture for K-net.

Two-stage system As described in Section V, we combine the two different models - one that is good at known objects (K-net) and the other that is good at novel objects (N-net) - in the two stage system. This is our final recognition algorithm, and it achieves better performance than any single model for test cases with a mixture of known and novel objects.

Method K vs N Known Novel Mixed
Nearest Neighbor 69.2 27.2 52.6 35.0
Siamese ([32, 36]) 70.3 76.9 68.2 74.2
Two-stream 70.8 85.3 75.1 82.2
Two-stream + GE 69.2 64.3 79.8 69.0
Two-stream + GE + MP (N-net) 69.2 56.8 82.1 64.6
N-net + AC (K-net) 93.2 99.7 29.5 78.1
Two-stage K-net + N-net 93.2 93.6 77.5 88.6
TABLE III: Recognition Evaluation (% Accuracy of Top-1 Match)

Vi-C Full System Evaluation in Amazon Robotics Challenge

To evaluate the performance of our system as a whole, we used it as part of our MIT-Princeton entry for the 2017 Amazon Robotics Challenge (ARC), where state-of-the-art pick-and-place solutions competed in the context of a warehouse automation task. Participants were tasked with designing a robot system to grasp and recognize a large variety of different objects in unstructured storage systems. The objects were characterized by a number of difficult-to-handle properties. Unlike earlier versions of the competition [37], half of the objects were novel in the 2017 edition of the competition. The physical objects as well as related item data (i.e. product images, weight, 3D scans), were given to teams just 30 minutes before the competition. While other teams used the 30 minutes to collect training data for the new objects and re-train models, our unique system did not require any of that during those 30 minutes.

Setup. Our system setup for the competition features several differences. We incorporated weight sensors to our system, using them as a guard to signal stop for grasping primitive behaviors during execution. We also used the measured weights of objects provided by Amazon to boost recognition accuracy to near perfect performance. Green screens made the background more uniform to further boost accuracy of the system in the recognition phase. For inferring affordances, Table I shows that our data-driven methods with ConvNets provide more precise affordances for both suction and grasping than the baseline algorithms. For the case of parallel-jaw grasping, however, we did not have time to develop a fully stable network architecture before the day of the competition, so we decided to avoid risks and use the baseline grasping algorithm. The ConvNet-based approach became stable with the reduction to inferring only horizontal grasps and rotating the input heightmaps. This is discussed more in depth in the Appendix, along with a state tracking/estimation algorithm used for the picking task of the ARC.

Results. During the ARC 2017 final stowing task, we had a 58.3% pick success with suction, 75% pick success with grasping, and 100% recognition accuracy during the stow task of the ARC, stowing all 20 objects within 24 suction attempts and 8 grasp attempts. Our system took 1st place in the stowing task, being the only system to have successfully stowed all known and novel objects and to have finished the task well within the allotted time frame.

Vii Discussion and Future Work

We present a system to pick and recognize novel objects with very limited prior information about them (a handful of product images). The system first uses an object-agnostic visuomotor affordance-based algorithm to select among four different grasping primitive actions, and then recognizes grasped objects by matching them to their product images. We evaluate both components and demonstrate their combination in a robot system that picks and recognizes novel objects in heavy clutter, and that took 1st place in the stowing task of the Amazon Robotics Challenge 2017. Here are some of the most salient features/limitations of the system:

Object-Agnostic Manipulation. The system finds grasp affordances directly in the RGB-D image. This proved faster and more reliable than doing object segmentation and state estimation prior to grasp planning [4]. The ConvNet learns the visual features that make a region of an image graspable or suctionable. It also seems to learn more complex rules, e.g., that tags are often easier to suction that the object itself, or that the center of a long object is preferable than its ends. It would be interesting to explore the limits of the approach. For example learning affordances for more complex behaviors, e.g., scooping an object against a wall, which require a more global understanding of the geometry of the environment.

Pick First, Ask Questions Later. The standard grasping pipeline is to first recognize and then plan a grasp. In this paper we demonstrate that it is possible and sometimes beneficial to reverse the order. Our system leverages object-agnostic picking to remove the need for state estimation in clutter. Isolating the picked object drastically increases object recognition reliability, especially for novel objects. We conjecture that ”pick first, ask questions later” is a good approach for applications such as bin-picking, emptying a bag of groceries, or clearing debris. It is, however, not suited for all applications – nominally when we need to pick a particular object. In that case, the described system needs to be augmented with state tracking/estimation algorithms.

Towards Scalable Solutions. Our system is designed to pick and recognize novel objects without extra data collection or re-training. This is a step forward towards robotic solutions that scale to the challenges of service robots and warehouse automation, where the daily number of novel objects ranges from the tens to the thousands, making data-collection and re-training cumbersome in one case and impossible in the other. It is interesting to consider what data, besides product images, is available that could be used for recognition using out-of-the-box algorithms like ours.

Limited to Accessible Grasps.

The system we present in this work is limited to picking objects that can be directly perceived and grasped by one of the primitive picking motions. Real scenarios, especially when targeting the grasp of a particular object, often require plans that deliberately sequence different primitive motions. For example, when removing an object to pick the one below, or when separating two objects before grasping one. This points to a more complex picking policy with a planning horizon that includes preparatory primitive motions like pushing whose value is difficult to reward/label in a supervised fashion. Reinforcement learning of policies that sequence primitive picking motions is a promising alternative approach worth exploring.

Open-loop vs. Closed-loop Grasping Most existing grasping approaches, whether model-based or data-driven are for the most part, based on open-loop executions of planned grasps. Our system is no different. The robot decides what to do and executes it almost blindly, except for simple feedback to enable guarded moves like move until contact. Indeed, the most common failure modes are when small errors in the estimated affordances lead to fingers landing on top of an object rather than on the sides, or lead to a deficient suction latch, or lead to a grasp that is only marginally stable and likely to fail when the robot lifts the object. It is unlikely that the picking error rate can be trimmed to industrial grade without the use of explicit feedback for closed-loop grasping during the approach-grasp-retrieve operation.


  • [1] Webpage for code and data. [Online]. Available:
  • [2] R. Jonschkowski, C. Eppner, S. Höfer, R. Martín-Martín, and O. Brock, “Probabilistic multi-class segmentation for the amazon picking challenge,” 2016.
  • [3] C. Hernandez, M. Bharatheesha, W. Ko, H. Gaiser, J. Tan, K. van Deurzen, M. de Vries, B. Van Mil, et al., “Team delft’s robot winner of the amazon picking challenge 2016,” arXiv, 2016.
  • [4] A. Zeng, K.-T. Yu, S. Song, D. Suo, E. Walker Jr, A. Rodriguez, and J. Xiao, “Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge,” in ICRA, 2017.
  • [5] M. Schwarz, A. Milan, C. Lenz, A. Munoz, A. S. Periyasamy, M. Schreiber, S. Schüller, and S. Behnke, “Nimbro picking: Versatile part handling for warehouse automation,” in ICRA, 2017.
  • [6] J. M. Wong, V. Kee, T. Le, S. Wagner, G.-L. Mariottini, A. Schneider, L. Hamilton, R. Chipalkatty, M. Hebert, et al., “Segicp: Integrated deep semantic segmentation and pose estimation,” arXiv, 2017.
  • [7] A. Bicchi and V. Kumar, “Robotic Grasping and Contact,” ICRA.
  • [8] A. Miller, S. Knoop, H. Christensen, and P. K. Allen, “Automatic grasp planning using shape primitives,” ICRA, 2003.
  • [9] M. Nieuwenhuisen, D. Droeschel, D. Holz, J. Stückler, A. Berner, J. Li, R. Klein, and S. Behnke, “Mobile bin picking with an anthropomorphic service robot,” in ICRA, 2013.
  • [10] M.-Y. Liu, O. Tuzel, A. Veeraraghavan, Y. Taguchi, T. K. Marks, and R. Chellappa, “Fast object localization and pose estimation in heavy clutter for robotic bin picking,” IJRR, 2012.
  • [11] e. a. Morales, Antonio, “Using experience for assessing grasp reliability,” in IJHR, 2004.
  • [12] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,” in IJRR, 2015.
  • [13] J. Redmon and A. Angelova, “Real-time grasp detection using convolutional neural networks,” in ICRA, 2015.
  • [14] A. ten Pas and R. Platt, “Using geometry to detect grasp poses in 3d point clouds,” in ISRR, 2015.
  • [15]

    L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” in

    ICRA, 2016.
  • [16] L. Pinto, J. Davidson, and A. Gupta, “Supervision via competition: Robot adversaries for learning tasks,” in ICRA, 2017.
  • [17] e. a. Mahler, Jeffrey, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” in RSS, 2017.
  • [18] M. Gualtieri, A. ten Pas, K. Saenko, and R. Platt, “High precision grasp pose detection in dense clutter,” in arXiv, 2017.
  • [19] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with large-scale data collection,” in ISER, 2016.
  • [20] E. Matsumoto, M. Saito, A. Kume, and J. Tan, “End-to-end learning of object grasp poses in the amazon robotics challenge.”
  • [21] R. Bajcsy and M. Campos, “Active and exploratory perception,” CVGIP: Image Understanding, vol. 56, no. 1, 1992.
  • [22] S. Chen, Y. Li, and N. M. Kwok, “Active vision in robotic systems: A survey of recent developments,” IJRR, 2011.
  • [23] D. Jiang, H. Wang, W. Chen, and R. Wu, “A novel occlusion-free active recognition algorithm for objects in clutter,” in ROBIO, 2016.
  • [24] K. Wu, R. Ranasinghe, and G. Dissanayake, “Active recognition and pose estimation of household objects in clutter,” in ICRA, 2015.
  • [25] D. Jayaraman and K. Grauman, “Look-ahead before you leap: End-to-end active recognition by forecasting the effect of motion,” in ECCV, 2016.
  • [26] R. Diankov, “Automated construction of robotic manipulation programs,” Ph.D. dissertation, CMU RI, 2010.
  • [27] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
  • [28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [29] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in ECCV, 2012.
  • [30] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models to new domains,” ECCV, 2010.
  • [31]

    A. Shrivastava, T. Malisiewicz, A. Gupta, and A. A. Efros, “Data-driven visual similarity for cross-domain image matching,” in

    TOG, 2011.
  • [32] S. Bell and K. Bala, “Learning visual similarity for product design with convolutional neural networks,” TOG, 2015.
  • [33]

    E. Hoffer, I. Hubara, and N. Ailon, “Deep unsupervised learning through spatial contrasting,”

    arXiv, 2016.
  • [34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
  • [35]

    M. Huh, P. Agrawal, and A. A. Efros, “What makes imagenet good for transfer learning?”

    arXiv, 2016.
  • [36] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in ICML Workshop, 2015.
  • [37] N. Correll, K. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser, K. Okada, A. Rodriguez, J. Romano, and P. Wurman, “Analysis and Observations from the First Amazon Picking Challenge,” T-ASE, 2016.
  • [38] M. Jaderberg, K. Simonyan, A. Zisserman, et al.

    , “Spatial transformer networks,” in

    Advances in neural information processing systems, 2015, pp. 2017–2025.

Viii Appendix

Viii-a Task Planner Details

Our full system for the ARC also includes a task planner that selects and executes the suction or grasp proposal with the highest affordance value. Prior to this, affordance values are scaled by a factor that is specific to the proposals’ primitive action types : suction down (sd), suction side (ss), grasp down (gd), or flush grasp (fg). The value of is determined by several task-specific heuristics that induce more efficient picking under competition settings. Here we briefly describe these heuristics:

Suction first, grasp later. We empirically find suction to be more reliable than parallel-jaw grasping when picking in scenarios with heavy clutter (10+ objects). Hence, to reflect a greedy picking strategy that initially favors suction over grasping, and for the first 3 minutes of either ARC task (stowing or picking).

Avoid repeating unsuccessful attempts. It is possible for the system to get stuck repeatedly executing the same (or similar) suction or grasp proposal as no change is made to the scene (and hence affordance estimates remain the same). Therefore, after each unsuccessful suction or parallel-jaw grasping attempt, the affordances of the proposals (for the same primitive action) nearby within radius 2cm of the unsuccessful attempt are set to 0.

Encouraging exploration upon repeat failures. The planner re-weights grasping primitive actions depending on how often they fail. For primitives that have been unsuccessful for two times in the last 3 minutes, ; if unsuccessful for more than three times, . This not only helps the system avoid repeating unsuccessful actions, but also prevents it from excessively relying on any one primitive that doesn’t work as expected (e.g. in the case of an unexpected hardware failure preventing suction air flow).

Leveraging dense affordances for speed picking. Our FCNs densely infer affordances for all visible surfaces in the scene, which enables the robot to attempt multiple different suction or grasping proposals (at least 3cm apart from each other) in quick succession until at least one of them is successful (given by immediate feedback from flow sensors or gripper finger width). This improves picking efficiency.

Viii-B State Tracking/Estimation

While the system described in the main paper works well out-of-the-box for the stowing task of the ARC, it requires an additional state tracking/estimation algorithm in order to perform competitively during the picking task of the ARC, where the goal is to pick target objects out of a storage system (e.g. shelves, separate work-cells) and place them into specific boxes for order fulfillment. Our state tracking algorithm is built around the assumption that each object in the storage system has been placed by another automated system – hence the identities of the objects and their positions in the storage system can be tracked over time as the storage system is stocked (e.g. from being completely empty to being full of objects).

The goal of our state tracking algorithm is to track the objects (their identities, 6D poses, amodal bounding boxes, and support relationships) as they are individually placed into the storage system one after the other. This information can then later be used by the task planner during the picking task to prioritize certain grasp proposals (close to, or above target objects) over others. After executing a grasp, our system continues to perform the recognition algorithm described in the main paper as a final verification step before placing it into a box. Objects that are not the intended target objects for order fulfillment are placed into another (relatively empty) bin in the storage system.

When autonomously adding an object into the storage system (e.g. during the stowing task), our state tracking algorithm captures RGB-D images of the storage system at time (before the object is placed) and at time (after the object is placed). The difference between the RGB-D images captured at and provides an estimate for the visible surfaces of the newly placed object (i.e. near the pixel regions with the largest change). 3D models of the objects (either constructed from the same RGB-D data captured during recognition or given by another system) are aligned to these visible surfaces via ICP-based pose estimation [4]. To reduce the uncertainty and noise of these pose estimates, the placing primitive actions are gently executed – i.e. the robot arm holding the object moves down slowly until contact between the object and storage system is detected with weight sensors, upon which then the gripper releases the object.

To handle placing into boxes with different sizes, our recognition framework simultaneously estimates a 3D bounding box of the grasped object (using the same RGB-D data captured for the recognition framework). The bounding box enables the placing primitives to re-orient grasped objects such that they can fit into the target boxes.

Fig. 7: Images and annotations from the grasping dataset with labels for suction (top row) and parallel-jaw grasping (bottom row). Positive labels appear in green while negative labels appear in red.

Viii-C Other Network Architectures for Parallel-Jaw Grasping

A significant challenge during the development of our system was designing a deep network architecture for inferring dense affordances for parallel-jaw grasping that 1) supports various gripper orientations and 2) could quickly converge during training with less than 2000 manually labeled images. It took several iterations of network architecture designs before discovering one that worked (described in the main paper). Here, we briefly review the deprecated architectures and their primary drawbacks:

Parallel trunks and branches ( copies). This design consists of separate FCNs, each responsible for inferring the output affordances for one of grasping angles. Each FCN shares the same architecture: a multi-modal trunk (with color (RGB) and depth (DDD) data fed into two ResNet-101 towers pre-trained on ImageNet, where features at the ends of both towers are concatenated across channels), followed by 3 additional spatial convolution layers to merge the features; then spatially bilinearly upsampled and softmaxed to output an affordance map. This design is similar to our final network design, but with two key differences: 1) there are multiple FCNs, one for each grasping angle, and 2) the input data is not rotated prior to feeding as input to the FCNs. This design is sample inefficient, since each network during training is optimized to learn a different set of visual features to support a specific grasping angle, thus requiring a substantial amount of training samples with that specific grasping angle to converge. Our small manually annotated dataset is characterized by an unequal distribution of training samples across different grasping angles, some of which have as little as less than 100 training samples. Hence, only a few of the FCNs (for grasping angles of which have more than 1,000 training samples) are able to converge during training. Furthermore, attaining the capacity to pre-load all FCNs into GPU memory for test time requires multiple GPUs.

One trunk, split to parallel branches. This design consists of a single FCN architecture, which contains a multi-modal ResNet-101 trunk followed by a split into parallel, individual branches, one for each grasping angle. Each branch contains 3 spatial convolution layers followed by spatial bilinearly upsampling and softmax to output affordance maps. While more lightweight in terms of GPU memory consumption (i.e. the trunk is shared and only the 3-layer branches have multiple copies), this FCN still runs into similar training convergence issues as the previous architecture, where each branch during training is optimized to learn a different set of visual features to support a specific grasping angle. The uneven distribution of limited training samples in our dataset made it so that only a few branches are able to converge during training.

One trunk, rotate, one branch. This design consists of a single FCN architecture, which contains a multi-modal ResNet-101 trunk, followed by a spatial transform layer [38] to rotate the intermediate feature map from the trunk with respect to an input grasp angle (such that the gripper orientation is aligned horizontally to the feature map), followed by a branch with 3 spatial convolution layers, spatially bilinearly upsampled, and softmaxed to output a single affordance map for the input grasp angle. This design is even more lightweight than the previous architecture in terms of GPU memory consumption, performs well with grasping angles for which there is a sufficient amount of training samples, but continues to performs poorly for grasping angles with very few training samples (less than 100).

One trunk and branch (rotate times). This is the final network architecture design as proposed in the main paper, which differs from the previous design in that the rotation occurs directly on the input image representation prior to feeding through the FCN (rather than in the middle of the architecture). This enables the entire network to share visual features across different grasping orientations, enabling it to generalize for grasping angles of which there are very few training samples.