GRIP: Generative Robust Inference and Perception for Semantic Robot Manipulation in Adversarial Environments

by   Xiaotong Chen, et al.

Recent advancements have led to a proliferation of machine learning systems used to assist humans in a wide range of tasks. However, we are still far from accurate, reliable, and resource-efficient operations of these systems. For robot perception, convolutional neural networks (CNNs) for object detection and pose estimation are recently coming into widespread use. However, neural networks are known to suffer overfitting during training process and are less robust within unseen conditions, which are especially vulnerable to adversarial scenarios. In this work, we propose Generative Robust Inference and Perception (GRIP) as a two-stage object detection and pose estimation system that aims to combine relative strengths of discriminative CNNs and generative inference methods to achieve robust estimation. Our results show that a second stage of sample-based generative inference is able to recover from false object detection by CNNs, and produce robust estimations in adversarial conditions. We demonstrate the efficacy of GRIP robustness through comparison with state-of-the-art learning-based pose estimators and pick-and-place manipulation in dark and cluttered environments.


page 1

page 3

page 6


Never Mind the Bounding Boxes, Here's the SAND Filters

Perception is the main bottleneck to perform autonomous mobile manipulat...

Manipulation-Oriented Object Perception in Clutter through Affordance Coordinate Frames

In order to enable robust operation in unstructured environments, robots...

Object-RPE: Dense 3D Reconstruction and Pose Estimation with Convolutional Neural Networks for Warehouse Robots

We present a system for accurate 3D instance-aware semantic reconstructi...

SegICP: Integrated Deep Semantic Segmentation and Pose Estimation

Recent robotic manipulation competitions have highlighted that sophistic...

A Robot Localization Framework Using CNNs for Object Detection and Pose Estimation

External localization is an essential part for the indoor operation of s...

Recurrent Residual Module for Fast Inference in Videos

Deep convolutional neural networks (CNNs) have made impressive progress ...

Perception for Autonomous Systems (PAZ)

In this paper we introduce the Perception for Autonomous Systems (PAZ) s...

I Introduction

Taking advantage of the renaissance in deep neural networks, machine learning has achieved great progress in object detection and segmentation [1] and image recognition [2]

. These deep learning methods are also prevalent in robotics for problems, including manipulation in clutter 

[3] and learning of manipulation actions [4]. For 6D object pose estimation, learning-based Convolutional Neural Networks (CNNs) have achieved promising accuracy and real-time inference speed [5, 6, 7]. Notably, these successes rely on well-designed models and adequate training resources. The robustness and generalization capability of CNNs heavily depend on the training data, which represents a certain range of conditions that could be faced by robots. However, due to the complex and dynamic nature of the real world, robots are subject to unseen environmental conditions, which are not present in the training data.

More specifically, CNNs recognition systems introduce vulnerability to errors (both benign and malicious) due to the effects of overfitting during the training process. Distorted objects and/or objects captured under poor lighting conditions could be enough to defeat the recognition abilities of a CNN [8]. Such perception errors can lead to (potentially disastrous) outcomes for embodied systems acting in the real world. These challenges for robust perception become that much more challenging when an adversary can modify the environment to exploit the vulnerabilities of a CNN. For instance, in the context of object recognition for a robotic system, a possible malicious attack (through simple modifications of an environment) has the potential to drastically alter and even manipulate a robot’s final behavior. Fig. 1 shows such a robot manipulation task under dark scene.

Fig. 1: Our GRIP system perceiving and grasping an object in adversarially darkened lighting. GRIP uses two stages of (a) PyramidCNN object detection bounding boxes with confidence score greater than 0.1 (green boxes), shown along with the ground truth (red box), and (b) sample-based generative inference. The (c) resulting estimate and (d) its localized pose (highlighted in cyan) enables (e) the Michigan Progress Fetch robot to accurately grasp the potted meat can object.

Generative-discriminative algorithms [9, 10] offer a promising avenue for robust perception. Such methods combine inference by deep learning (or other discriminative techniques) with sampling and probabilistic inference models to achieve robust and adaptive perception in adversarial environments. The value proposition for generative-discriminiative inference is to get the best out of existing approaches to computational perception and robotic manipulation while avoiding their shortcomings. We want the robustness of belief space planning [11, 12] without its computational intractability. The recall power of neural networks without excessive overfitting [4]. The efficiency of deterministic inference without its fragility to uncertainty [13, 14]. Generative-discriminative algorithms will be especially advantageous when exposed to adversarial attack, building on foundational ideas in this space [15, 16, 17, 18, 19]. Furthermore, we expect our approach will be more generally applicable to guard against broad categories of attack with a clear pathway for explanability of the resulting perceptual estimates.

In this paper, we present Generative Robust Inference and Perception (GRIP) as a two-stage method to explore generative-discriminiative inference for object recognition and pose estimation in adversarial environments. Within GRIP, we represent the first stage of inference as a CNN-based recognition distribution. The CNN recognition distribution is used within a second stage of generative multi-hypothesis optimization. This optimization is implemented as a particle filter with a static state process. We show that our GRIP method produces comparable and improved performance with respect to state-of-the-art pose estimation systems (PoseCNN [5] and DOPE [6]) under adversarial scenarios with varied lighting and cluttered occlusion. Moreover, we demonstrate the compatibility of GRIP with goal-directed sequential manipulation in object pick-and-place tasks with a Michigan Progress Fetch robot.

Ii Background

Fig. 2: Overview of GRIP. The robot operating in a dark and cluttered environment is to grasp the meat can from its RGBD observation. Stage 1 takes the RGB image and generates object bounding boxes with confidence scores. Stage 2 takes the depth image and performs sample-based generative inference to estimate the pose for each object in the scene. The samples in Stage 2 are initialized according to bounding boxes from Stage 1. From this estimate, the robot performs manipulation on the meat can object.

Ii-a Motivation

To get the best of both worlds, we consider the state-of-the-art as the relative strengths and weaknesses of deep learning and generative inference for robust perception. We are particularly interested in complementary properties of these methods for making perceptual decisions, where the weaknesses of one can be addressed by the strengths of the other. Despite the strengths of CNNs, they have several shortcomings that leave them vulnerable to adversarial action, such as their opacity in understanding how its decisions are made, fragility for generalizing beyond overfit training examples, and inflexibility for recovering when false decisions are produced. For these methods, Goodfellow et al. [20] demonstrated that adversarial examples are misclassified both in the case of different architectures or different subsets of the training data. These weaknesses for CNNs play to the strengths of robustness for generative probabilistic inference, which are inherently: explainable, general, and resilient through the process of generating, evaluating, and maintaining a distribution of many hypotheses representing possible decisions. However, this robustness comes at the cost of computational efficiency. Probabilistic inference, in contrast to CNNs, is often computationally intractable with complexity that grows exponentially with the number of variables. GRIP aims to overcomes these limitations by combining the strengths of deep learning and probabilistic inference through a two-stage algorithm, illustrated in Fig. 2 and discussed later in Section IV. The remainder of this background section will provide a broader overview of related existing works.

Ii-B Perception for Manipulation

Perception is a critical step for robotic manipulation in unstructured environments. Ciocarlie et al. [21] proposed an architecture for reliable grasping and manipulation, where non-touching, isolated objects are estimated by clustering the surface normal of RGBD sensor data. The MOPED framework [19] has been proposed for object detection and pose estimation using iterative clustering estimation from multi-view features. A bottom-up approach is taken in [22] using RANSAC and Iterative Closest Point registration (ICP), relying solely on geometric information. Narayanan et al. [15] integrated global search with discriminatively trained algorithms to balance robustness and efficiency, which works on multi-object identification, assuming known objects.

For manipulation in dense cluttered environments, ten Pas and Platt [23] showed success in detecting grasp affordances from 3D point clouds. In [24], they sample grasp pose candidates based on their geometric plausibility, from which feasible grasp poses are selected by a CNN. Regarding manipulation with known object geometry models, [25, 26, 27] proposed generative sampling approaches to scene estimation for object poses and physical support relations. However, these methods used object detection bounding boxes with hard thresholding as the prior for generative sampling, which might cause false negatives.

Ii-C Object Detection and Pose Estimation

Learning-based approaches have been used as modules in object pose estimation systems, or directly built end-to-end approaches. Sui et al. [9] proposed a sample-based two-stage framework to sequential manipulation tasks, where object detection results are used as prior of sample initialization. Mitash et al. [28] developed a two-stage approach, which ran stochastic sampling of congruent sets [29] to get object poses based on the semantic map from a segmentation network. Regarding end-to-end systems, PoseCNN [5] was proposed by constructing a neural-network that learned segmentation, object 3D translation, and 3D rotation separately. This work also contributed a large object dataset, called YCB-Video-Dataset, for benchmarking robotics pose estimation and manipulation approaches. DOPE [6] outperformed PoseCNN in estimation accuracy and robustness in dark and occluded scenes by training the network on a generated synthetic dataset from domain-randomization and photo-realistic simulation. Another recent work, DenseFusion [7], utilized two networks to extract RGB and depth features separately. 6D poses are learned from the combined feature and refined the pose by another residual network module.

In this paper, we focus on the pose estimation problem in adversarial scenarios. Liu et al. [10] provided insight into handling adversarial clutter, yet provided limited evaluations of its approach or comparisons with state-of-the-art methods. We believe that the performance of CNNs relies highly on the consistency of the testing environment to the training set, and that the same is true for the two-stage methods in [9] and [28] since they rely on high-quality CNN output from their first stages. Our main contribution is the development of a two-stage pose estimation system that is robust under adversarial scenarios and able to recover from false detections from its own first stage.

Iii Problem Formulation

Given an RGB-D observation (,

) from the robot sensor and 3D geometry models of a known object set, our aim is to estimate the conditional joint distribution

for each object class , where is the six DoF object pose and is the object bounding box in the RGB image. The problem can be formulated as:


Equations (1) and (2

) are derived using chain rule statistics and Equation (

3) represents the factoring of object detection and pose estimation. Here, we assume that pose estimation is conditionally independent of RGB observation, while object detection is conditionally independent of depth observation.

Ideally, we could use Markov Chain Monte Carlo (MCMC)

[30] to estimate the distribution of Equation (1). However, the state space of the entire states is so large that it is intractable to directly compute. End-to-end neural network methods can also be used to calculate the distribution [5, 6, 7]. These results place a heavy reliance on proper coverage of the input space in the training set. This data reliance makes such methods vulnerable to unforeseen environment changes. SUM [9] implements a combination of Equation (1) to filter over hard detections provided by a CNN, thereby enabling it to filter out false positive CNN detections. The limitation of SUM is its inability to recover from false negatives that are eliminated from consideration in object proposal and detection stages. On the other hand, our GRIP paradigm is able to compensate for data deficiency by employing a generative sampling method in the second stage.

Iv Method

We propose a two-stage paradigm to combine object detection and pose estimation, as shown in Fig. 2. In the first stage of inference, PyramidCNN performs object detection and generates a prior distribution of 2D bounding boxes for each object label . In the second stage, we perform generative multi-hypothesis optimization to estimate the joint distribution for each object label using the first stage output as prior. The second stage is implemented as an iterated likelihood weighting filter [31]:


where is the normalizing factor. In Equation (4), initial pose is generated from bounding boxes

, which are sampled from the prior distribution generated by first stage. After the second stage, we get a probability distribution of pose estimation as shown in Equation (

1). We consider the best estimate as the one with highest probability. Equivalently, best pose satisfies,


Iv-a Object Detection

The goal of the first stage is to provide a probability distribution map for an object class in a given input image. To achieve this, we exploit the discriminative power of CNNs. Inspired by region proposal networks (RPN) in [19], our PyramidCNN serves as a proposal method for the second stage. We choose VGG-16 networks [32] to extract features, which are directed to two fully convolutional networks (FCN) [33]

: a classifier learning the object labels and a shape network learning the bounding box aspect ratios. The structure of PyramidCNN is detailed in Fig. 


The input to our networks is a pyramid of images at different scales. This enables the networks to detect objects with different sizes and appearing at various distances. Thus, the output contains a pyramid of heatmaps representing bounding boxes associated with confidence scores, positions, aspect ratios, and sizes for each object class. Different from end-to-end learning systems, we do not apply any threshold to the confidence scores in order to avoid any false negatives generated by the first stage.

Iv-B Pose Estimation

The purpose of the second stage is to estimate the object pose by performing iterated likelihood weighting, which offers us robustness and versatility over the search space. This is critical in our context since the manipulation task heavily depends on the accuracy of pose estimations. We expect the second stage to perform robustly even with inaccurate detection from the first stage.

Iv-B1 Initial Samples

We use a set of weighted samples to represent the belief of object pose, where each 6D sample pose corresponds to a weight . Given an object class , its pose , and the corresponding geometry model, we can render a 3D point cloud observation r using the z-buffer of a 3D graphics engine. Essentially, these rendered point clouds are what would be observed if the object had the hypothesized poses, which we refer to as rendered samples hereafter. The samples are initialized according to the first stage output. As mentioned in Section IV-A, our CNN produces a density pyramid that is essentially a list of bounding boxes with confidence scores. We perform importance sampling over the confidence scores and initialize our samples uniformly within the 3D workspaces indicated by sampled bounding boxes as shown in Equation (4). More samples are spawned within bounding boxes with higher confidence scores.

Iv-B2 Likelihood Function

The weight of each sample is calculated by the likelihood function, which evaluates the compatibility of a sample with observations as shown in Equation (5). The likelihood function consists of several parts, including bounding boxes weight, raw pixel-wise inlier ratio, and feature-based inlier ratio. We first define the raw pixel-wise inlier function as:


where refer to a point in observation point cloud z and a point in rendered point cloud from sample pose respectively. A rendered point is considered as an inlier if it is within a certain sensor resolution range from an observed point. The point-wise inlier ratio of a rendered sample is then defined as:


where refers to 2D image indices in the rendered sample point cloud r and observation point cloud z. refers to point cloud size.

Besides raw point-wise inliers, we extract geometry feature point clouds from both rendered samples and observation point clouds and compute feature inlier ratios. Hereby, we enhance the robustness of the likelihood function by considering contextual geometric information from 3D point clouds. This term prunes wrong poses that agree with the observation only in individual points but neglect higher-level geometric information such as depth discontinuity and sharp object surfaces. We apply feature point extraction introduced by Zhang et al. [34] based on local surface smoothness,


where the smoothness value

is calculated by adding all displacement vectors from

to each of its neighbor points . The point cloud p here can be either rendered sample r or observation z. The value is then normalized by the size of and the length of vector . Intuitively, describes the depth changing rate within a certain local range, which has larger values in areas with acute depth changes and smaller values where object surfaces are consistent. We extract two features, edge points and planar points, by selecting point sets with largest and smallest values respectively. To balance feature point density in areas with different observation quality, we set a maximum number of edge points and planar points to be extracted from a certain local area. Essentially, a point at can be selected as an edge or a planar point only if its

value is larger or smaller than a threshold and if the number of selected points has not exceeded the limit. We find that the algorithm is insensitive to our feature extraction parameters. Finally, we apply feature extraction on both rendered sample and scene observation point cloud to get sample features and observation features. We use the same inlier calculation in Equations (

8) and (9) to calculate feature inlier ratios.

The weight of each hypothesis is defined as


where is the confidence score of the bounding box. is the ratio of pixel-wise inliers in the whole rendered sample point cloud. is the inlier ratio in the portion of rendered sample that is within the bounding box ( is 0 if no rendered sample point falls into the bounding box). and are inlier ratios in sample edges and sample planars with respect to observation features. The coefficients represent the importance of each likelihood term and sum up to 1. Notably, the first two terms, and , are heavily determined by the bounding boxes from the first stage PyramidCNN and describe the consistency between pose sample and first stage detection. We refer to them as network terms. The last three terms weigh how much the current hypothesis explains itself in the scene geometry. Therefore, we refer to them as geometric terms.

Iv-B3 Update Process

To produce object pose estimations, we follow the procedure of iterated likelihood weighting by first assigning a new weight to each sample. Resampling is done with replacement according to sample weights. During the diffusion process shown in Equation (6), each pose is diffused in the space subject to zero-mean Gaussian noises and

with time-varying variances for translation and rotation respectively. The standard deviations

and at iteration are decayed according to , the weight of best pose estimation at that iteration. Bounding boxes are not diffused. The algorithm terminates when reaches a threshold , or the iteration limit is reached. Finally, we assume the pose weights of objects in the scene will be much higher than those for non-existing objects.

V Experiments

V-a Implementation

We use PyTorch for our CNN implementation based on a VGG16 model pre-trained on ImageNet

[35]. The shape network branch of our CNN predicts 7 different aspect ratios. The size of a training image is 224

224 and contains a single object. The aspect ratio of an object in the training image can be inferred from the width and height of the object. We apply a softmax at the end of the network to generate probability distribution of object classification and aspect ratio. We use cross entropy as the loss function in training.

Our second stage pose estimation relies on the OpenGL graphics engine to render depth images with 3D geometry models and hypothesis poses on Nvidia GTX1080/RTX2070 graphics cards. During the iterated likelihood weighting process, we allocate 625 samples for each iteration and run the algorithm for 400 iterations in total, with set to 0.005m. The sample size is limited by the buffer size of our rendering engine, while the iteration limit was set since our pose estimation converges after approximately 300 iterations. Point distance threshold was set to approximated distance between adjacent points in 3D point clouds.

In the feature extraction mentioned in Section IV-B2, we extract up to 5 edge points and 2 planar points from each 55 pixel non-overlap sliding window. Given a 3D point cloud p, we consider as an edge point if (see Equation (10)), or a planar point otherwise. These hyper-parameters are determined experimentally for clear indication of object boundaries as well as surfaces. The likelihood coefficients are set to . Through experiments, we find that the system performance is sensitive to the total category weight allocated to network terms and geometric terms, rather than the allocation within each category. If the first stage produces accurate detection evaluated by mean average precision (mAP), one can take advantage of it by allocating more weight to network terms. Otherwise, one should reduce the weight of network terms to attenuate the negative impact of underperforming first stage. Since our first stage produces low-mAP detection, which will be shown in section V-C1, we allocate only 20% of the weight on network terms. We allocate the remaining 80% to geometric terms since these terms are robust to adversarial scenarios and unreliable first stage detection. Further weight allocation within each category is done approximately evenly.

During diffusion, standard deviations of the Gaussian noises are decayed by a common factor , which drops exponentially from 1.0 to 0.0 as increases from 0.6 to 1.0. In other words, the standard deviations at iteration are given by , where


is the indicator function. Initial standard deviations are and for translation and rotation respectively. The threshold for convergence condition is set to 0.9.

V-B Dataset and Baselines

We use the YCB video dataset [5] as the training data for our first stage PyramidCNN. The YCB video dataset consists of 133,827 frames of 21 objects under normal conditions with balanced and adequate lighting but no occlusion. To test the performance of our two-stage method with baseline methods, PoseCNN [5] and DOPE [6], we collect a testing dataset (i.e., adversarial YCB dataset) from 40 scenes with 15 out of 21 objects from YCB video dataset under adversarial scenarios. In each scene, we place 5-7 different objects on a table and collect seven frames: one in normal lighting, one in darkness, two with different single light sources, and three with different cluttered object placements (see Fig. 3). The dark setting and two single-lighting settings cause bias in image pixels values from the training set and thus undermine network prediction. We refer to these settings as varied lighting for simplicity. In addition, object clutter causes occlusions as well as natural information loss and challenges the robustness of pose estimation algorithms. All the scene images and 3D point clouds are gathered by the RGB-D sensor on our Fetch robot. Ground truth bounding boxes and 6D poses are manually labeled.

Fig. 3: Testing dataset with YCB objects under adversarial settings. The base-setting data is collected with regular lighting without occlusions. The dark-setting data is collected with all lights off in the room. The single lighting data is collected with a flash light. Object poses are the same in previous three settings. Data in three occlusion scenes is collected with the same objects randomly stacked.

V-C Evaluation

V-C1 Improved mAP by second stage

The mean average precision (mAP) of our method for first stage PyramidCNN detection and final pose estimation results are listed in Table I. We use sparse data to train the PyramidCNN for testing the robustness of the second stage inference. Further, the varied lighting and cluttered occlusion in testing yielded a low mAP score for the PyramidCNN output in comparison to unaltered environments. After second stage of generative sampling, the mAPs are improved beyond 0.5. Thus, our second stage has successfully improved pose estimation performance under adversarial scenarios.

mAP Base Varied Lighting Occlusion
PyramidCNN 0.2824 0.1401 0.1711
GRIP 0.6739 0.5475 0.5069
TABLE I: Mean average precision (mAP) of first stage PyramidCNN and GRIP.

(a) Base settings. (b) Varied Light settings. (c) Occlusion settings

Fig. 4: The comparison between DOPE, PoseCNN+ICP and our GRIP two-stage method on pose estimation accuracy of 4 objects mentioned in Sec. V-C.
Fig. 5: Overall pose estimation accuracy of 15 YCB objects using PoseCNN and our GRIP method.
Area Under Accuracy-Threshold Curve Base Varied Lighting Occlusions
003_cracker_box 0.6384 0.5925 0.7878 0.5509 0.6225 0.7923 0.5703 0.4850 0.7442
005_tomato_soup_can 0.7691 0.5535 0.9015 0.5326 0.5181 0.8104 0.6372 0.6013 0.8347
006_mustard_bottle 0.5720 0.4280 0.8860 0.6290 0.6310 0.8552 0.7295 0.5864 0.8208
007_tuna_fish_can 0.3763 0.7670 0.3616 0.7849 0.3915 0.6220
010_potted_meat_can 0.5556 0.6756 0.8226 0.4347 0.5273 0.5006 0.5045 0.5962 0.6342
011_banana 0.3922 0.5467 0.3449 0.5591 0.2137 0.2750
019_pitcher_base 0.2442 0.1774 0.2490 0.1659 0.3341 0.1874
021_bleach_cleanser 0.3302 0.5671 0.3111 0.5523 0.2204 0.4635
024_bowl 0.3190 0.8674 0.3397 0.7109 0.2345 0.6185
025_mug 0.3491 0.2201 0.3176 0.2170 0.2216 0.2094
037_scissors 0.5450 0.3192 0.5812 0.1647 0.3548 0.1783
040_large_marker 0.2071 0.7537 0.4094 0.6750 0.2736 0.6711
051_large_clamp 0.2405 0.4927 0.2061 0.1642 0.0645 0.2551
052_extra_large_clamp 0.4000 0.1742 0.1460 0.1742 0.2147 0.1441
061_foam_brick 0.8094 0.8333 0.6380 0.8011 0.5419 0.8297
Overall 0.4308 0.6078 0.4136 0.5285 0.3556 0.4992
TABLE II: Overall Performance (Area Under accuracy-threshold Curve) of 15 YCB Objects on DOPE, PoseCNN with ICP and our GRIP method. Symmetric objects are marked with stars and evaluated using ADD-S; asymmetric objects are evaluated using ADD.

V-C2 Comparing accuracy with PoseCNN and DOPE with 4 YCB objects

We compare our pose estimation accuracy with PoseCNN (with ICP refinement) and DOPE on the adversarial YCB dataset. We use pre-trained models from the authors’ Github page for PoseCNN111 and DOPE222 and train our first stage PyramidCNN using 2500 frames from the original YCB video dataset. Since DOPE is trained with 5 of 21 objects from the YCB Video Dataset, we first compare all three methods on 4 of them: 003_cracker_box, 005_tomato_soup_can, 006_mustard_bottle and 010_potted_meat_can. The fifth object, 004_sugar_box, was unavailable from the market when this experiment was set up. We use ADD and ADD-S metrics [5] to calculate pose error for asymmetric and symmetric objects respectively (symmetric objects are marked with asterisks in Table. II). In manipulation tasks, the bearable pose estimation error is bounded by the clearance that objects have when placed in the robot end effector. Based on the sizes of Fetch robot gripper and target objects, we choose 0.04m as the maximum error tolerance. Therefore, we plot accuracy-threshold curves within a range of [0.00m, 0.04m] in Fig. 4

and calculate AUC (Area Under accuracy-threshold Curve) as evaluation metric.

GRIP outperforms the other two methods under most error thresholds, especially lower ones, and thereby facilitates robotic manipulation tasks.

V-C3 Comparing accuracy with PoseCNN with 15 YCB objects

Next, we perform an extensive comparison of our method with PoseCNN (with ICP) on 15 of the 21 YCB objects. Table II and Fig. 5 show our overall results and detailed accuracy evaluations for each object.

GRIP outperforms PoseCNN+ICP for most objects under all three settings. All methods have worse performances under varied lighting and occlusions as opposed to basic setting. We can infer the strengths and weaknesses of each method from its performance variance among different objects. For example, PoseCNN with ICP performs better on symmetric objects such as 003_cracker_box and 061_foam_brick as opposed to others such as 021_bleach_cleanser. Symmetric objects contain repetitive features which are more likely to be captured by learning-based systems. GRIP performs better on objects that are well recognizable under depth camera. Large and compact objects such as 006_mustard_bottle and 024_bowl naturally generate dense and continuous 3D point cloud observations that effectively capture their geometry. Objects with thin or articulated parts, such as 037_scissors, 052_extra_large_clamp, and 025_mug, produce sparse point clouds around their handle-like parts that do not effectively reveal the scene geometry, especially object orientations. Hence, our GRIP algorithm best suits scenarios where rich depth sensory data are available due to detectable object dimensions and surface materials or high-definition depth sensors. Finally, distinguishing near-identical objects remains challenging. For instance, 051_large_clamp and 052_extra_large_clamp have identical colors and shapes and differ only insignificantly in sizes. This results in poor estimation accuracy by all methods.

Vi Conclusions

We have introduced GRIP as a two-stage method for robust 6D object pose estimation suited to adversarial settings. GRIP demonstrated similar and improved performance with respect to state-of-the-art neural network pose estimators considering the adversarial YCB dataset. The key insight of GRIP is to avoid hard thresholding, which introduces false positives and false negatives, until a final pose estimate is required. Avoiding hard thresholds increases the possibility of finding the real pose in adversarial environments. In addition, a generative second stage inherently provides an avenue for explainable perception, without requiring deciphering network weights. Also, this generative process readily extends to tracking over multiple instances of time through the inclusion of a proper process model. The results presented are also amenable to improvement due to the limited types of features considered. These benefits come at the cost of assuming only one instance of each object is present in the scene. For future work, we aim to investigate these limitations through exploring features amenable to robust inference with multiple object instances in greater clutter.


  • [1] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 580–587, 2014.
  • [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [3] Marcus Gualtieri, Andreas ten Pas, and Robert Platt. Pick and place without geometric object models. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7433–7440. IEEE, 2018.
  • [4] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  • [5] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
  • [6] Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790, 2018.
  • [7] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martín-Martín, Cewu Lu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. arXiv preprint arXiv:1901.04780, 2019.
  • [8] Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song. Robust physical-world attacks on deep learning models. arXiv preprint arXiv:1707.08945, 1:1, 2017.
  • [9] Zhiqiang Sui, Zheming Zhou, Zhen Zeng, and Odest Chadwicke Jenkins.

    Sum: Sequential scene understanding and manipulation.

    In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 3281–3288. IEEE, 2017.
  • [10] Yanqi Liu, Alessandro Costantini, R Bahar, Zhiqiang Sui, Zhefan Ye, Shiyang Lu, and Odest Chadwicke Jenkins. Robust object estimation using generative-discriminative inference for secure robotics applications. In Proceedings of the International Conference on Computer-Aided Design, page 75. ACM, 2018.
  • [11] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
  • [12] Leslie Pack Kaelbling and Tomás Lozano-Pérez. Integrated task and motion planning in belief space. The International Journal of Robotics Research, 32(9-10):1194–1227, 2013.
  • [13] Richard E Fikes and Nils J Nilsson. Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 2(3-4):189–208, 1971.
  • [14] Shiwali Mohan, Aaron H Mininger, James R Kirk, and John E Laird. Acquiring grounded representations of words with situated interactive instruction. In Advances in Cognitive Systems. Citeseer, 2012.
  • [15] Venkatraman Narayanan and Maxim Likhachev. Discriminatively-guided deliberative perception for pose estimation of multiple 3d object instances. In Robotics: Science and Systems, 2016.
  • [16] Venkatraman Narayanan and Maxim Likhachev. Perch: Perception via search for multi-object recognition and localization. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 5052–5059. IEEE, 2016.
  • [17] Ziyuan Liu, Dong Chen, Kai M Wurm, and Georg von Wichert. Table-top scene analysis using knowledge-supervised mcmc. Robotics and Computer-Integrated Manufacturing, 33:110–123, 2015.
  • [18] Dominik Joho, Gian Diego Tipaldi, Nikolas Engelhard, Cyrill Stachniss, and Wolfram Burgard. Nonparametric bayesian models for unsupervised scene analysis and reconstruction. Robotics, page 161, 2013.
  • [19] Alvaro Collet, Manuel Martinez, and Siddhartha S Srinivasa. The MOPED framework: Object recognition and pose estimation for manipulation. The International Journal of Robotics Research, 30(10):1284–1306, 2011.
  • [20] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • [21] Matei Ciocarlie, Kaijen Hsiao, Edward Gil Jones, Sachin Chitta, Radu Bogdan Rusu, and Ioan A Şucan. Towards reliable grasping and manipulation in household environments. In Experimental Robotics, pages 241–252. Springer, 2014.
  • [22] Chavdar Papazov, Sami Haddadin, Sven Parusel, Kai Krieger, and Darius Burschka. Rigid 3d geometry matching for grasping of known objects in cluttered scenes. The International Journal of Robotics Research, 31(4):538–553, 2012.
  • [23] Andreas Ten Pas and Robert Platt. Localizing handle-like grasp affordances in 3d point clouds. In Experimental Robotics, pages 623–638. Springer, 2016.
  • [24] Andreas ten Pas, Marcus Gualtieri, Kate Saenko, and Robert Platt. Grasp pose detection in point clouds. The International Journal of Robotics Research, 36(13-14):1455–1473, 2017.
  • [25] Zhiqiang Sui, Odest Chadwicke Jenkins, and Karthik Desingh. Axiomatic particle filtering for goal-directed robotic manipulation. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4429–4436. IEEE, 2015.
  • [26] Karthik Desingh, Odest Chadwicke Jenkins, Lionel Reveret, and Zhiqiang Sui. Physically plausible scene estimation for manipulation in clutter. In 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), pages 1073–1080. IEEE, 2016.
  • [27] Zhen Zeng, Zheming Zhou, Zhiqiang Sui, and Odest Chadwicke Jenkins. Semantic robot programming for goal-directed manipulation in cluttered scenes. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7462–7469. IEEE, 2018.
  • [28] Chaitanya Mitash, Abdeslam Boularias, and Kostas Bekris. Robust 6d object pose estimation with stochastic congruent sets. arXiv preprint arXiv:1805.06324, 2018.
  • [29] Nicolas Mellado, Dror Aiger, and Niloy J Mitra. Super 4pcs fast global pointcloud registration via smart indexing. In Computer Graphics Forum, volume 33, pages 205–215. Wiley Online Library, 2014.
  • [30] W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. 1970.
  • [31] Stephen J. Mckenna and Hammadi Nait-Charif. Tracking human motion using auxiliary particle filters and iterated likelihood weighting. Image Vision Comput., 25:852–862, 2007.
  • [32] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [33] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [34] Ji Zhang and Sanjiv Singh. Loam: Lidar odometry and mapping in real-time. In Robotics: Science and Systems, volume 2, page 9, 2014.
  • [35] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009.