I Introduction
Taking advantage of the renaissance in deep neural networks, machine learning has achieved great progress in object detection and segmentation [1] and image recognition [2]
. These deep learning methods are also prevalent in robotics for problems, including manipulation in clutter
[3] and learning of manipulation actions [4]. For 6D object pose estimation, learningbased Convolutional Neural Networks (CNNs) have achieved promising accuracy and realtime inference speed [5, 6, 7]. Notably, these successes rely on welldesigned models and adequate training resources. The robustness and generalization capability of CNNs heavily depend on the training data, which represents a certain range of conditions that could be faced by robots. However, due to the complex and dynamic nature of the real world, robots are subject to unseen environmental conditions, which are not present in the training data.More specifically, CNNs recognition systems introduce vulnerability to errors (both benign and malicious) due to the effects of overfitting during the training process. Distorted objects and/or objects captured under poor lighting conditions could be enough to defeat the recognition abilities of a CNN [8]. Such perception errors can lead to (potentially disastrous) outcomes for embodied systems acting in the real world. These challenges for robust perception become that much more challenging when an adversary can modify the environment to exploit the vulnerabilities of a CNN. For instance, in the context of object recognition for a robotic system, a possible malicious attack (through simple modifications of an environment) has the potential to drastically alter and even manipulate a robot’s final behavior. Fig. 1 shows such a robot manipulation task under dark scene.
Generativediscriminative algorithms [9, 10] offer a promising avenue for robust perception. Such methods combine inference by deep learning (or other discriminative techniques) with sampling and probabilistic inference models to achieve robust and adaptive perception in adversarial environments. The value proposition for generativediscriminiative inference is to get the best out of existing approaches to computational perception and robotic manipulation while avoiding their shortcomings. We want the robustness of belief space planning [11, 12] without its computational intractability. The recall power of neural networks without excessive overfitting [4]. The efficiency of deterministic inference without its fragility to uncertainty [13, 14]. Generativediscriminative algorithms will be especially advantageous when exposed to adversarial attack, building on foundational ideas in this space [15, 16, 17, 18, 19]. Furthermore, we expect our approach will be more generally applicable to guard against broad categories of attack with a clear pathway for explanability of the resulting perceptual estimates.
In this paper, we present Generative Robust Inference and Perception (GRIP) as a twostage method to explore generativediscriminiative inference for object recognition and pose estimation in adversarial environments. Within GRIP, we represent the first stage of inference as a CNNbased recognition distribution. The CNN recognition distribution is used within a second stage of generative multihypothesis optimization. This optimization is implemented as a particle filter with a static state process. We show that our GRIP method produces comparable and improved performance with respect to stateoftheart pose estimation systems (PoseCNN [5] and DOPE [6]) under adversarial scenarios with varied lighting and cluttered occlusion. Moreover, we demonstrate the compatibility of GRIP with goaldirected sequential manipulation in object pickandplace tasks with a Michigan Progress Fetch robot.
Ii Background
Iia Motivation
To get the best of both worlds, we consider the stateoftheart as the relative strengths and weaknesses of deep learning and generative inference for robust perception. We are particularly interested in complementary properties of these methods for making perceptual decisions, where the weaknesses of one can be addressed by the strengths of the other. Despite the strengths of CNNs, they have several shortcomings that leave them vulnerable to adversarial action, such as their opacity in understanding how its decisions are made, fragility for generalizing beyond overfit training examples, and inflexibility for recovering when false decisions are produced. For these methods, Goodfellow et al. [20] demonstrated that adversarial examples are misclassified both in the case of different architectures or different subsets of the training data. These weaknesses for CNNs play to the strengths of robustness for generative probabilistic inference, which are inherently: explainable, general, and resilient through the process of generating, evaluating, and maintaining a distribution of many hypotheses representing possible decisions. However, this robustness comes at the cost of computational efficiency. Probabilistic inference, in contrast to CNNs, is often computationally intractable with complexity that grows exponentially with the number of variables. GRIP aims to overcomes these limitations by combining the strengths of deep learning and probabilistic inference through a twostage algorithm, illustrated in Fig. 2 and discussed later in Section IV. The remainder of this background section will provide a broader overview of related existing works.
IiB Perception for Manipulation
Perception is a critical step for robotic manipulation in unstructured environments. Ciocarlie et al. [21] proposed an architecture for reliable grasping and manipulation, where nontouching, isolated objects are estimated by clustering the surface normal of RGBD sensor data. The MOPED framework [19] has been proposed for object detection and pose estimation using iterative clustering estimation from multiview features. A bottomup approach is taken in [22] using RANSAC and Iterative Closest Point registration (ICP), relying solely on geometric information. Narayanan et al. [15] integrated global search with discriminatively trained algorithms to balance robustness and efficiency, which works on multiobject identification, assuming known objects.
For manipulation in dense cluttered environments, ten Pas and Platt [23] showed success in detecting grasp affordances from 3D point clouds. In [24], they sample grasp pose candidates based on their geometric plausibility, from which feasible grasp poses are selected by a CNN. Regarding manipulation with known object geometry models, [25, 26, 27] proposed generative sampling approaches to scene estimation for object poses and physical support relations. However, these methods used object detection bounding boxes with hard thresholding as the prior for generative sampling, which might cause false negatives.
IiC Object Detection and Pose Estimation
Learningbased approaches have been used as modules in object pose estimation systems, or directly built endtoend approaches. Sui et al. [9] proposed a samplebased twostage framework to sequential manipulation tasks, where object detection results are used as prior of sample initialization. Mitash et al. [28] developed a twostage approach, which ran stochastic sampling of congruent sets [29] to get object poses based on the semantic map from a segmentation network. Regarding endtoend systems, PoseCNN [5] was proposed by constructing a neuralnetwork that learned segmentation, object 3D translation, and 3D rotation separately. This work also contributed a large object dataset, called YCBVideoDataset, for benchmarking robotics pose estimation and manipulation approaches. DOPE [6] outperformed PoseCNN in estimation accuracy and robustness in dark and occluded scenes by training the network on a generated synthetic dataset from domainrandomization and photorealistic simulation. Another recent work, DenseFusion [7], utilized two networks to extract RGB and depth features separately. 6D poses are learned from the combined feature and refined the pose by another residual network module.
In this paper, we focus on the pose estimation problem in adversarial scenarios. Liu et al. [10] provided insight into handling adversarial clutter, yet provided limited evaluations of its approach or comparisons with stateoftheart methods. We believe that the performance of CNNs relies highly on the consistency of the testing environment to the training set, and that the same is true for the twostage methods in [9] and [28] since they rely on highquality CNN output from their first stages. Our main contribution is the development of a twostage pose estimation system that is robust under adversarial scenarios and able to recover from false detections from its own first stage.
Iii Problem Formulation
Given an RGBD observation (,
) from the robot sensor and 3D geometry models of a known object set, our aim is to estimate the conditional joint distribution
for each object class , where is the six DoF object pose and is the object bounding box in the RGB image. The problem can be formulated as:(1)  
(2)  
(3) 
) are derived using chain rule statistics and Equation (
3) represents the factoring of object detection and pose estimation. Here, we assume that pose estimation is conditionally independent of RGB observation, while object detection is conditionally independent of depth observation.Ideally, we could use Markov Chain Monte Carlo (MCMC)
[30] to estimate the distribution of Equation (1). However, the state space of the entire states is so large that it is intractable to directly compute. Endtoend neural network methods can also be used to calculate the distribution [5, 6, 7]. These results place a heavy reliance on proper coverage of the input space in the training set. This data reliance makes such methods vulnerable to unforeseen environment changes. SUM [9] implements a combination of Equation (1) to filter over hard detections provided by a CNN, thereby enabling it to filter out false positive CNN detections. The limitation of SUM is its inability to recover from false negatives that are eliminated from consideration in object proposal and detection stages. On the other hand, our GRIP paradigm is able to compensate for data deficiency by employing a generative sampling method in the second stage.Iv Method
We propose a twostage paradigm to combine object detection and pose estimation, as shown in Fig. 2. In the first stage of inference, PyramidCNN performs object detection and generates a prior distribution of 2D bounding boxes for each object label . In the second stage, we perform generative multihypothesis optimization to estimate the joint distribution for each object label using the first stage output as prior. The second stage is implemented as an iterated likelihood weighting filter [31]:
(4)  
(5)  
(6) 
where is the normalizing factor. In Equation (4), initial pose is generated from bounding boxes
, which are sampled from the prior distribution generated by first stage. After the second stage, we get a probability distribution of pose estimation as shown in Equation (
1). We consider the best estimate as the one with highest probability. Equivalently, best pose satisfies,(7) 
Iva Object Detection
The goal of the first stage is to provide a probability distribution map for an object class in a given input image. To achieve this, we exploit the discriminative power of CNNs. Inspired by region proposal networks (RPN) in [19], our PyramidCNN serves as a proposal method for the second stage. We choose VGG16 networks [32] to extract features, which are directed to two fully convolutional networks (FCN) [33]
: a classifier learning the object labels and a shape network learning the bounding box aspect ratios. The structure of PyramidCNN is detailed in Fig.
2.The input to our networks is a pyramid of images at different scales. This enables the networks to detect objects with different sizes and appearing at various distances. Thus, the output contains a pyramid of heatmaps representing bounding boxes associated with confidence scores, positions, aspect ratios, and sizes for each object class. Different from endtoend learning systems, we do not apply any threshold to the confidence scores in order to avoid any false negatives generated by the first stage.
IvB Pose Estimation
The purpose of the second stage is to estimate the object pose by performing iterated likelihood weighting, which offers us robustness and versatility over the search space. This is critical in our context since the manipulation task heavily depends on the accuracy of pose estimations. We expect the second stage to perform robustly even with inaccurate detection from the first stage.
IvB1 Initial Samples
We use a set of weighted samples to represent the belief of object pose, where each 6D sample pose corresponds to a weight . Given an object class , its pose , and the corresponding geometry model, we can render a 3D point cloud observation r using the zbuffer of a 3D graphics engine. Essentially, these rendered point clouds are what would be observed if the object had the hypothesized poses, which we refer to as rendered samples hereafter. The samples are initialized according to the first stage output. As mentioned in Section IVA, our CNN produces a density pyramid that is essentially a list of bounding boxes with confidence scores. We perform importance sampling over the confidence scores and initialize our samples uniformly within the 3D workspaces indicated by sampled bounding boxes as shown in Equation (4). More samples are spawned within bounding boxes with higher confidence scores.
IvB2 Likelihood Function
The weight of each sample is calculated by the likelihood function, which evaluates the compatibility of a sample with observations as shown in Equation (5). The likelihood function consists of several parts, including bounding boxes weight, raw pixelwise inlier ratio, and featurebased inlier ratio. We first define the raw pixelwise inlier function as:
(8) 
where refer to a point in observation point cloud z and a point in rendered point cloud from sample pose respectively. A rendered point is considered as an inlier if it is within a certain sensor resolution range from an observed point. The pointwise inlier ratio of a rendered sample is then defined as:
(9) 
where refers to 2D image indices in the rendered sample point cloud r and observation point cloud z. refers to point cloud size.
Besides raw pointwise inliers, we extract geometry feature point clouds from both rendered samples and observation point clouds and compute feature inlier ratios. Hereby, we enhance the robustness of the likelihood function by considering contextual geometric information from 3D point clouds. This term prunes wrong poses that agree with the observation only in individual points but neglect higherlevel geometric information such as depth discontinuity and sharp object surfaces. We apply feature point extraction introduced by Zhang et al. [34] based on local surface smoothness,
(10) 
where the smoothness value
is calculated by adding all displacement vectors from
to each of its neighbor points . The point cloud p here can be either rendered sample r or observation z. The value is then normalized by the size of and the length of vector . Intuitively, describes the depth changing rate within a certain local range, which has larger values in areas with acute depth changes and smaller values where object surfaces are consistent. We extract two features, edge points and planar points, by selecting point sets with largest and smallest values respectively. To balance feature point density in areas with different observation quality, we set a maximum number of edge points and planar points to be extracted from a certain local area. Essentially, a point at can be selected as an edge or a planar point only if itsvalue is larger or smaller than a threshold and if the number of selected points has not exceeded the limit. We find that the algorithm is insensitive to our feature extraction parameters. Finally, we apply feature extraction on both rendered sample and scene observation point cloud to get sample features and observation features. We use the same inlier calculation in Equations (
8) and (9) to calculate feature inlier ratios.The weight of each hypothesis is defined as
(11) 
where is the confidence score of the bounding box. is the ratio of pixelwise inliers in the whole rendered sample point cloud. is the inlier ratio in the portion of rendered sample that is within the bounding box ( is 0 if no rendered sample point falls into the bounding box). and are inlier ratios in sample edges and sample planars with respect to observation features. The coefficients represent the importance of each likelihood term and sum up to 1. Notably, the first two terms, and , are heavily determined by the bounding boxes from the first stage PyramidCNN and describe the consistency between pose sample and first stage detection. We refer to them as network terms. The last three terms weigh how much the current hypothesis explains itself in the scene geometry. Therefore, we refer to them as geometric terms.
IvB3 Update Process
To produce object pose estimations, we follow the procedure of iterated likelihood weighting by first assigning a new weight to each sample. Resampling is done with replacement according to sample weights. During the diffusion process shown in Equation (6), each pose is diffused in the space subject to zeromean Gaussian noises and
with timevarying variances for translation and rotation respectively. The standard deviations
and at iteration are decayed according to , the weight of best pose estimation at that iteration. Bounding boxes are not diffused. The algorithm terminates when reaches a threshold , or the iteration limit is reached. Finally, we assume the pose weights of objects in the scene will be much higher than those for nonexisting objects.V Experiments
Va Implementation
We use PyTorch for our CNN implementation based on a VGG16 model pretrained on ImageNet
[35]. The shape network branch of our CNN predicts 7 different aspect ratios. The size of a training image is 224224 and contains a single object. The aspect ratio of an object in the training image can be inferred from the width and height of the object. We apply a softmax at the end of the network to generate probability distribution of object classification and aspect ratio. We use cross entropy as the loss function in training.
Our second stage pose estimation relies on the OpenGL graphics engine to render depth images with 3D geometry models and hypothesis poses on Nvidia GTX1080/RTX2070 graphics cards. During the iterated likelihood weighting process, we allocate 625 samples for each iteration and run the algorithm for 400 iterations in total, with set to 0.005m. The sample size is limited by the buffer size of our rendering engine, while the iteration limit was set since our pose estimation converges after approximately 300 iterations. Point distance threshold was set to approximated distance between adjacent points in 3D point clouds.
In the feature extraction mentioned in Section IVB2, we extract up to 5 edge points and 2 planar points from each 55 pixel nonoverlap sliding window. Given a 3D point cloud p, we consider as an edge point if (see Equation (10)), or a planar point otherwise. These hyperparameters are determined experimentally for clear indication of object boundaries as well as surfaces. The likelihood coefficients are set to . Through experiments, we find that the system performance is sensitive to the total category weight allocated to network terms and geometric terms, rather than the allocation within each category. If the first stage produces accurate detection evaluated by mean average precision (mAP), one can take advantage of it by allocating more weight to network terms. Otherwise, one should reduce the weight of network terms to attenuate the negative impact of underperforming first stage. Since our first stage produces lowmAP detection, which will be shown in section VC1, we allocate only 20% of the weight on network terms. We allocate the remaining 80% to geometric terms since these terms are robust to adversarial scenarios and unreliable first stage detection. Further weight allocation within each category is done approximately evenly.
During diffusion, standard deviations of the Gaussian noises are decayed by a common factor , which drops exponentially from 1.0 to 0.0 as increases from 0.6 to 1.0. In other words, the standard deviations at iteration are given by , where
(12) 
is the indicator function. Initial standard deviations are and for translation and rotation respectively. The threshold for convergence condition is set to 0.9.
VB Dataset and Baselines
We use the YCB video dataset [5] as the training data for our first stage PyramidCNN. The YCB video dataset consists of 133,827 frames of 21 objects under normal conditions with balanced and adequate lighting but no occlusion. To test the performance of our twostage method with baseline methods, PoseCNN [5] and DOPE [6], we collect a testing dataset (i.e., adversarial YCB dataset) from 40 scenes with 15 out of 21 objects from YCB video dataset under adversarial scenarios. In each scene, we place 57 different objects on a table and collect seven frames: one in normal lighting, one in darkness, two with different single light sources, and three with different cluttered object placements (see Fig. 3). The dark setting and two singlelighting settings cause bias in image pixels values from the training set and thus undermine network prediction. We refer to these settings as varied lighting for simplicity. In addition, object clutter causes occlusions as well as natural information loss and challenges the robustness of pose estimation algorithms. All the scene images and 3D point clouds are gathered by the RGBD sensor on our Fetch robot. Ground truth bounding boxes and 6D poses are manually labeled.
VC Evaluation
VC1 Improved mAP by second stage
The mean average precision (mAP) of our method for first stage PyramidCNN detection and final pose estimation results are listed in Table I. We use sparse data to train the PyramidCNN for testing the robustness of the second stage inference. Further, the varied lighting and cluttered occlusion in testing yielded a low mAP score for the PyramidCNN output in comparison to unaltered environments. After second stage of generative sampling, the mAPs are improved beyond 0.5. Thus, our second stage has successfully improved pose estimation performance under adversarial scenarios.
mAP  Base  Varied Lighting  Occlusion 
PyramidCNN  0.2824  0.1401  0.1711 
GRIP  0.6739  0.5475  0.5069 
Area Under AccuracyThreshold Curve  Base  Varied Lighting  Occlusions  
DOPE  PoseCNN  GRIP  DOPE  PoseCNN  GRIP  DOPE  PoseCNN  GRIP  
003_cracker_box  0.6384  0.5925  0.7878  0.5509  0.6225  0.7923  0.5703  0.4850  0.7442 
005_tomato_soup_can  0.7691  0.5535  0.9015  0.5326  0.5181  0.8104  0.6372  0.6013  0.8347 
006_mustard_bottle  0.5720  0.4280  0.8860  0.6290  0.6310  0.8552  0.7295  0.5864  0.8208 
007_tuna_fish_can  0.3763  0.7670  0.3616  0.7849  0.3915  0.6220  
010_potted_meat_can  0.5556  0.6756  0.8226  0.4347  0.5273  0.5006  0.5045  0.5962  0.6342 
011_banana  0.3922  0.5467  0.3449  0.5591  0.2137  0.2750  
019_pitcher_base  0.2442  0.1774  0.2490  0.1659  0.3341  0.1874  
021_bleach_cleanser  0.3302  0.5671  0.3111  0.5523  0.2204  0.4635  
024_bowl  0.3190  0.8674  0.3397  0.7109  0.2345  0.6185  
025_mug  0.3491  0.2201  0.3176  0.2170  0.2216  0.2094  
037_scissors  0.5450  0.3192  0.5812  0.1647  0.3548  0.1783  
040_large_marker  0.2071  0.7537  0.4094  0.6750  0.2736  0.6711  
051_large_clamp  0.2405  0.4927  0.2061  0.1642  0.0645  0.2551  
052_extra_large_clamp  0.4000  0.1742  0.1460  0.1742  0.2147  0.1441  
061_foam_brick  0.8094  0.8333  0.6380  0.8011  0.5419  0.8297  
Overall  0.4308  0.6078  0.4136  0.5285  0.3556  0.4992 
VC2 Comparing accuracy with PoseCNN and DOPE with 4 YCB objects
We compare our pose estimation accuracy with PoseCNN (with ICP refinement) and DOPE on the adversarial YCB dataset. We use pretrained models from the authors’ Github page for PoseCNN^{1}^{1}1https://rselab.cs.washington.edu/projects/posecnn/ and DOPE^{2}^{2}2https://github.com/NVlabs/Deep_Object_Pose and train our first stage PyramidCNN using 2500 frames from the original YCB video dataset. Since DOPE is trained with 5 of 21 objects from the YCB Video Dataset, we first compare all three methods on 4 of them: 003_cracker_box, 005_tomato_soup_can, 006_mustard_bottle and 010_potted_meat_can. The fifth object, 004_sugar_box, was unavailable from the market when this experiment was set up. We use ADD and ADDS metrics [5] to calculate pose error for asymmetric and symmetric objects respectively (symmetric objects are marked with asterisks in Table. II). In manipulation tasks, the bearable pose estimation error is bounded by the clearance that objects have when placed in the robot end effector. Based on the sizes of Fetch robot gripper and target objects, we choose 0.04m as the maximum error tolerance. Therefore, we plot accuracythreshold curves within a range of [0.00m, 0.04m] in Fig. 4
and calculate AUC (Area Under accuracythreshold Curve) as evaluation metric.
GRIP outperforms the other two methods under most error thresholds, especially lower ones, and thereby facilitates robotic manipulation tasks.VC3 Comparing accuracy with PoseCNN with 15 YCB objects
Next, we perform an extensive comparison of our method with PoseCNN (with ICP) on 15 of the 21 YCB objects. Table II and Fig. 5 show our overall results and detailed accuracy evaluations for each object.
GRIP outperforms PoseCNN+ICP for most objects under all three settings. All methods have worse performances under varied lighting and occlusions as opposed to basic setting. We can infer the strengths and weaknesses of each method from its performance variance among different objects. For example, PoseCNN with ICP performs better on symmetric objects such as 003_cracker_box and 061_foam_brick as opposed to others such as 021_bleach_cleanser. Symmetric objects contain repetitive features which are more likely to be captured by learningbased systems. GRIP performs better on objects that are well recognizable under depth camera. Large and compact objects such as 006_mustard_bottle and 024_bowl naturally generate dense and continuous 3D point cloud observations that effectively capture their geometry. Objects with thin or articulated parts, such as 037_scissors, 052_extra_large_clamp, and 025_mug, produce sparse point clouds around their handlelike parts that do not effectively reveal the scene geometry, especially object orientations. Hence, our GRIP algorithm best suits scenarios where rich depth sensory data are available due to detectable object dimensions and surface materials or highdefinition depth sensors. Finally, distinguishing nearidentical objects remains challenging. For instance, 051_large_clamp and 052_extra_large_clamp have identical colors and shapes and differ only insignificantly in sizes. This results in poor estimation accuracy by all methods.
Vi Conclusions
We have introduced GRIP as a twostage method for robust 6D object pose estimation suited to adversarial settings. GRIP demonstrated similar and improved performance with respect to stateoftheart neural network pose estimators considering the adversarial YCB dataset. The key insight of GRIP is to avoid hard thresholding, which introduces false positives and false negatives, until a final pose estimate is required. Avoiding hard thresholds increases the possibility of finding the real pose in adversarial environments. In addition, a generative second stage inherently provides an avenue for explainable perception, without requiring deciphering network weights. Also, this generative process readily extends to tracking over multiple instances of time through the inclusion of a proper process model. The results presented are also amenable to improvement due to the limited types of features considered. These benefits come at the cost of assuming only one instance of each object is present in the scene. For future work, we aim to investigate these limitations through exploring features amenable to robust inference with multiple object instances in greater clutter.
References

[1]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.
Rich feature hierarchies for accurate object detection and semantic
segmentation.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 580–587, 2014.  [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [3] Marcus Gualtieri, Andreas ten Pas, and Robert Platt. Pick and place without geometric object models. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7433–7440. IEEE, 2018.
 [4] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 [5] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
 [6] Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790, 2018.
 [7] Chen Wang, Danfei Xu, Yuke Zhu, Roberto MartínMartín, Cewu Lu, Li FeiFei, and Silvio Savarese. Densefusion: 6d object pose estimation by iterative dense fusion. arXiv preprint arXiv:1901.04780, 2019.
 [8] Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, and Dawn Song. Robust physicalworld attacks on deep learning models. arXiv preprint arXiv:1707.08945, 1:1, 2017.

[9]
Zhiqiang Sui, Zheming Zhou, Zhen Zeng, and Odest Chadwicke Jenkins.
Sum: Sequential scene understanding and manipulation.
In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 3281–3288. IEEE, 2017.  [10] Yanqi Liu, Alessandro Costantini, R Bahar, Zhiqiang Sui, Zhefan Ye, Shiyang Lu, and Odest Chadwicke Jenkins. Robust object estimation using generativediscriminative inference for secure robotics applications. In Proceedings of the International Conference on ComputerAided Design, page 75. ACM, 2018.
 [11] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(12):99–134, 1998.
 [12] Leslie Pack Kaelbling and Tomás LozanoPérez. Integrated task and motion planning in belief space. The International Journal of Robotics Research, 32(910):1194–1227, 2013.
 [13] Richard E Fikes and Nils J Nilsson. Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 2(34):189–208, 1971.
 [14] Shiwali Mohan, Aaron H Mininger, James R Kirk, and John E Laird. Acquiring grounded representations of words with situated interactive instruction. In Advances in Cognitive Systems. Citeseer, 2012.
 [15] Venkatraman Narayanan and Maxim Likhachev. Discriminativelyguided deliberative perception for pose estimation of multiple 3d object instances. In Robotics: Science and Systems, 2016.
 [16] Venkatraman Narayanan and Maxim Likhachev. Perch: Perception via search for multiobject recognition and localization. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 5052–5059. IEEE, 2016.
 [17] Ziyuan Liu, Dong Chen, Kai M Wurm, and Georg von Wichert. Tabletop scene analysis using knowledgesupervised mcmc. Robotics and ComputerIntegrated Manufacturing, 33:110–123, 2015.
 [18] Dominik Joho, Gian Diego Tipaldi, Nikolas Engelhard, Cyrill Stachniss, and Wolfram Burgard. Nonparametric bayesian models for unsupervised scene analysis and reconstruction. Robotics, page 161, 2013.
 [19] Alvaro Collet, Manuel Martinez, and Siddhartha S Srinivasa. The MOPED framework: Object recognition and pose estimation for manipulation. The International Journal of Robotics Research, 30(10):1284–1306, 2011.
 [20] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 [21] Matei Ciocarlie, Kaijen Hsiao, Edward Gil Jones, Sachin Chitta, Radu Bogdan Rusu, and Ioan A Şucan. Towards reliable grasping and manipulation in household environments. In Experimental Robotics, pages 241–252. Springer, 2014.
 [22] Chavdar Papazov, Sami Haddadin, Sven Parusel, Kai Krieger, and Darius Burschka. Rigid 3d geometry matching for grasping of known objects in cluttered scenes. The International Journal of Robotics Research, 31(4):538–553, 2012.
 [23] Andreas Ten Pas and Robert Platt. Localizing handlelike grasp affordances in 3d point clouds. In Experimental Robotics, pages 623–638. Springer, 2016.
 [24] Andreas ten Pas, Marcus Gualtieri, Kate Saenko, and Robert Platt. Grasp pose detection in point clouds. The International Journal of Robotics Research, 36(1314):1455–1473, 2017.
 [25] Zhiqiang Sui, Odest Chadwicke Jenkins, and Karthik Desingh. Axiomatic particle filtering for goaldirected robotic manipulation. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4429–4436. IEEE, 2015.
 [26] Karthik Desingh, Odest Chadwicke Jenkins, Lionel Reveret, and Zhiqiang Sui. Physically plausible scene estimation for manipulation in clutter. In 2016 IEEERAS 16th International Conference on Humanoid Robots (Humanoids), pages 1073–1080. IEEE, 2016.
 [27] Zhen Zeng, Zheming Zhou, Zhiqiang Sui, and Odest Chadwicke Jenkins. Semantic robot programming for goaldirected manipulation in cluttered scenes. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7462–7469. IEEE, 2018.
 [28] Chaitanya Mitash, Abdeslam Boularias, and Kostas Bekris. Robust 6d object pose estimation with stochastic congruent sets. arXiv preprint arXiv:1805.06324, 2018.
 [29] Nicolas Mellado, Dror Aiger, and Niloy J Mitra. Super 4pcs fast global pointcloud registration via smart indexing. In Computer Graphics Forum, volume 33, pages 205–215. Wiley Online Library, 2014.
 [30] W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. 1970.
 [31] Stephen J. Mckenna and Hammadi NaitCharif. Tracking human motion using auxiliary particle filters and iterated likelihood weighting. Image Vision Comput., 25:852–862, 2007.
 [32] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [33] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
 [34] Ji Zhang and Sanjiv Singh. Loam: Lidar odometry and mapping in realtime. In Robotics: Science and Systems, volume 2, page 9, 2014.
 [35] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. 2009.