Robust and reliable operation of autonomous mobile manipulators remains an open challenge for robotics, where perception remains a critical bottleneck. Within the well-known sense-plan-act paradigm, truly autonomous robot manipulators need the ability to perceive the world, reason over manipulation actions afforded by objects towards a given goal, and carry out these actions in terms of physical motion. However, performing manipulation in unstructured and cluttered environments is particularly challenging due to many factors. Particularly, to execute a task with specific grasp points demands first recognizing object and estimating its precise pose. Figure 1 illustrates such a task. The robot moves the objects from the table to the shelf based on their categories.
With the advent of convolutional neural networks (CNN), many challenging perception tasks have been improved significantly, such as image classification and object detection . However, relying solely on CNN to detect objects and estimate their poses in a cluttered environment poses several issues. For instance, varying orientations and occlusions may greatly alter the appearance of objects, which may affects the performance of object detectors. Therefore, to avoid making hard decisions on the detection result, we present an alternative paradigm for object detection as well as pose estimation so that it can not only utilize the discriminative power given by deep neural networks but also maintain the versatility and robustness despite the everchanging environments.
Our two-stage approach builds upon both CNN and generative sampling-based local search method to achieve sampling the network density, or SAND filter. The first stage of SAND filter attempts to detect objects using CNN and RGB images. However, unlike other popular object detectors, such as Faster RCNN 
, we do not perform any filtering over the object bounding boxes, no matter their object confidence scores. Hence, the sampling method in the second stage could take full advantage of the probability density prior provided by CNN. Generative sampling method has been widely used in robot localization and object tracking  due to its robustness. we, instead, employ such methods to perform local search over the hypothesis space in observed depth images so as to refine the object detection results as well as estimate object poses. Each sample, in our second stage, represents a possible object pose. The weight of each sample is bootstrapped by the prior given by the CNN and re-sampled based on its hypothesized states, from the rendering engine, and the observed state from the depth image. After iterations, the samples will essentially converge and the final state, which can best represent the observed scene.
In this paper, we demonstrate that the second stage of SAND filter can effectively improve object detection result from the first stage, and further produces accurate pose estimation. It is worth noting that our SAND filter does not limit to our own CNN implementation, and it can be adapted to other CNN-based object detectors with minor modification. Likewise, the sampling-based local search in the second stage can also be replaced.
Ii Related Work
Ii-a Perception for Manipulation
PR2 interactive manipulation  segmented non-touching objects from a flat surface by clustering of surface normals. Collet et al. presented a discriminative approach, MOPED, to detect object and estimate object pose using iterative clustering-estimation (ICE) using multiple cameras . Papazov et al.  used a bottom-up approach of matching the 3D object geometries using RANSAC and retrieval by hashing methods. Narayanan et al. 
integrate A* global search with the heuristics from the neural networks to perform scene estimation assuming known identification of objects.
et al. tried to perform the pick and place task in the deep reinforcement learning framework. Varley et al.
developed a grasp pose generation system under partial view of the objects using deep learning.
Ii-B Object Detection
Followed by the success of AlexNet , regions with convolutional neural network features, or R-CNN , introduced by Girshick et al., has become the dominating method for object detection. It first utilizes low/mid-level features to generate object proposal , 
, and then uses CNN to extract feature within each proposal. Finally, a linear classifier, such as SVM, is trained using those features for the classification task.  and 
further optimized feature extraction process in R-CNN. Unified approaches, which integrate object proposal and classification, including our work, are inspired by R-CNN.
Long et al. propose FCN for semantic segmentation by replacing fully connected layers in traditional CNN with convolutional layers . FCN take images of arbitrary size and provide per-pixel classification label. However, FCN are not able to separate neighboring objects within the same category to obtain instance-level label; hence we cannot directly re-task FCN for object detection purpose. Nonetheless, most unified approaches are based on FCN to localize and classify objects using the same networks.
Sermanet et al. propose a integrated CNN framework for classification, localization and detection in a multiscale and sliding window fashion . All three tasks are learned simultaneously using a same shared network. Ren et al. expand the approach of  by taking the same networks for classification task and repurposing them for generating object proposals. Redmon et al. use a different approach by dividing the image into regions using a single network, and predicting bounding boxes and classification score for each region .
Our approach is inspired by Faster R-CNN  and pose estimation by generative sampling methods , , . However, our approach proposes a two-stage framework to further address the challenge that detecting objects and estimating poses in a cluttered environment by enabling the generative sampling methods with the full potential of deep neural networks.
Iii Problem Formulation
Given an RGB-D observation (,
) from the robot sensor, our aim is to estimate the joint distribution, where is the six DoF object pose in the world frame, which comprises 3D spatial location and 3D orientation, is the object bounding box with a confidence score in the 2D image-space and is the object label with its corresponding 3D geometry model. Figure 3 illustrates the formulation using factor graph for each object and can be represented as the following formulation:
Equation 2 and 3 are derived using chain rule statistics and equation 4 represents the factoring of object detection, pose estimation and the observation prior. Here, we assume that pose estimation is conditionally independent of RGB observation, , and object detection is conditionally independent of depth observation, .
Ideally, we could use Markov chain Monte Carlo (MCMC)  to estimate the distribution of Equation 1. However, the state space of the entire states are so large which makes it intractable to directly compute. End-to-end neural network method can also be also used to calculate the distribution. For instance, PoseCNN attempts to estimate object pose given RGB images only within a single CNN framework . However, PoseCNN requires significant amount of data and human annotation in order to train the CNN. Our paradigm, on the other hand, is able to compensate the data deficiency by employing a generative sampling method in the second stage. SUM  implements a simple combination of Equation 1 to enable sequential manipulation. However, in , data association is required to track the location of objects over time, which may lead to prolonged mis-detections if given malignant initial estimation. Furthermore, SUM may suffer from inevitable false detection, and hence, poor pose estimation, because a hard filtering is performed after object proposal and detection stage.
Iv Inference Method
We propose a two stage paradigm to compute object detection factor and pose estimation factor in two stages. Figure 2 illustrates the overview of our method. Our robot first has to estimate pose for each object given an RGBD image under cluttered environment. In the first stage, our CNN localize objects and give each bounding boxes a confidence score. Then, in the second stage, we perform generative sampling-based optimization to estimate the object pose given a depth image and object bounding boxes with scores. Once the pose of an object has been estimated, the robot picks up the object and places it on a shelf categorically. We hope that the heuristics from the first stage can better inform the generative sampling optimization in the second stage and the generative sampling can help check on the false detections from the first stage.
Instead of directly computing Equation 1
because of the curse of dimensionality, we aim to maximize the joint probability by finding a poseand can be defined as
Thus, due to the nature of , our paradigm is limited to estimating one instance of an object class in the scene. However, we will extend our paradigm to accommodate multiple instances by estimating the joint probability in the future.
Iv-a Object Detection
The goal of our stage one method is to provide object bounding boxes with confidence scores given an object class . To achieve this, we exploit the discriminative power of CNN. Inspired by region proposal networks (RPN) in , our CNN serve as a proposal method for the second stage. However, instead of only classifying objects as object and non-object, our networks are able to produce the object class labels.
We choose VGG-16 networks  as our base networks. VGG-16 has
weight layers coupled with ReLU layers and max pooling layers. To enable VGG-16 to perform object detection at each window location, we replace fully connected layers withconvolutional layers to construct fully convolutional networks (FCN), such as in . Consequently, our networks “convolve” the input RGB image in a sliding window fashion.
Beyond the classification output, we would also like to predict the aspect ratio of the object bounding box. After conv2 layer, our CNN extend another branch to predict the aspect ratio of the bounding boxes, which is class agnostic. The detailed architecture is illustrated in Figure 4. Here, we choose not to adopt bounding box regression or anchors method. Because of occlusions and varying poses of objects(Figure 1), it’s challenging to estimate the exact 2D coordinates of objects with regression or predict aspect ratios based on statistics. Hence, our shape branch of CNN does not require class specific feature from later layers , such as conv5, and simply intends to provide aspect ratio prior and leave the exact localization to the second stage.
The input to our networks is a pyramid of images with different scales. This is to enable the networks to capture objects of all sizes. The output, thus, is a pyramid of heatmaps with another pyramid of shape maps. Each pixel in the heatmap, with the corresponding pixel in the shape map, represents a bounding box in the input image. For each bounding box, there is a categorical distribution that represents possible outcomes of all object classes. Therefore, the bounding boxes, , received by the next stage, is a list of 5-tuple that represents object 2D coordinates and a confidence score, , given an object class .
Iv-B Pose Estimation
The purpose of the second stage is to estimate the object pose, , and further refine the bounding box, , based on the estimated pose, by performing generative sampling-based local search. Our local search is inspired by sampling methods, such as bootstrap filter , which offer us robustness and versatility over the search space, which is critical in our context, since the result of the first stage may be imperfect and the manipulation task depends on the accuracy of the pose. Hence, we expect the second stage to improve object localization, or even correct false detections, based on the result of the first stage.
A collection of weighted samples, , to represent multiple hypotheses that indicate the states of object poses. Given an object class , we have a corresponding object geometry model , and therefore can render a point cloud, , using the z-buffer from a 3D graphics engine, given an object pose and camera view matrix . Essentially, these rendered point clouds are our collection of samples .
The initial samples are determined by the output of the first stage. Recall that in Section IV-A, our CNN produce a density pyramid which is a list of bounding boxes with confidence scores, . We perform the importance sampling over the confidence scores to initialize our samples. The bounding box with a higher confidence score will get more samples to search over.
To evaluate a sample state, we first crop the depth image with the corresponding bounding box, and then back-project it into a point cloud, .Note that can be different for different samples as they associate with different bounding boxes. We measure the “similarity” between the rendered point cloud and observed point cloud by counting how many points they match with each other. First, we define the inlier function as the following,
where is a point in the observed point cloud , and is a point in the rendered point cloud, . If the Euclidean distance between an observed point and a rendered point is within a certain sensor resolution range, , the total number of inliers will increase by 1. The number of inliers is then defined as
where and are 2D indices in the observed point cloud . Next, the weight of each hypothesis is defined as,
where is the number of points in the observed point cloud , is the number of points in the rendered point cloud and , and are coefficients. The first term in equation 9 weighs how much the rendered point cloud match within the bounding box with the observed point cloud. However, since the bounding boxes from Section IV-A are not perfect and usually truncate the objects, the second term is used to accommodate this by weighing how much the current hypothesis can explain itself not only in the bounding box but in the scene. We further blend in the object confidence score from the previous stage to balance between the two stages.
To get the the optimum pose , we follow the procedure of importance sampling to assign a new weight to each sample. During the re-sampling process, each pose,
, would also be perturbed by a normal distribution in the space of six DoF. Once the average weight is above a threshold,, we consider the local search is converged, and , which is the sample with maximum weight, can best approximate the true object pose.
We use PyTorch111http://pytorch.org/ for our CNN implementation. The aspect ratio branch of can predict aspect ratios. One training image contains only one object, and the size is
. The aspect ratio of an object in the training image can be inferred from the width and height of the object. We ignore the aspect ratio labels of background images during training. The activation function for both branches of our CNN issoftmax
since we consider predicting aspect ratio as a classification task as well. Thus, the loss function in training phase iscross entropy.
Our second stage local search method relies on OpenGL graphics engine to render depth images, given a 3D geometry model and a camera view matrix. During the local search process, we allocate samples for each iteration and perform iterations in total. After the final iteration, we select the sample with highest weight and consider it to be the estimated object pose.
Followed by pose estimation, we further generate grasp poses of the object based on the method in . To perform categorical sorting task as illustrated in Figure 2, we use a Fetch robot to perform object grasping and placement. The Fetch robot is a mobile manipulator with seven DoF arm and a pan-tilt head equipped with an RGB-D sensor.
Our dataset contains images with object classes and are collected in a cluttered tabletop environment as shown in Figure 5. All objects are labelled with 2D bounding boxes and six DoF poses. We use images for training and parameter tuning and images for testing.
We first present the result and analysis on pose estimation accuracy, followed by the robot categorical sorting tasks.
Vi-B1 Pose Estimation
In the estimation experiments, we evaluated the performance of SAND filter method on 28 test images in the dataset. We employ the evaluation metric from to measure the mean of the pairwise point distance between the ground truth point cloud and the rendered point cloud from the estimated 6D pose. This metric can also account for objects with symmetric axis, which is useful in comparing household objects.
We first compare the SAND filter with a baseline method where it takes the state-of-art object detector Faster-RCNN  as the first stage and the widely used pose estimation algorithm ICP  as the second stage. Shown in Figure 6, the SAND filter method is consistently better than the baseline method over all the thresholds. For the tighter distance threshold (e.g. 0.02 meter), our approach is able to achieve over sixty percent accuracy, which greatly outperforms the baseline method. This is due to the ability that our second stage can iteratively narrow down the search space through sampling-based local search. Note that for the baseline approach, the orientation initialization of the ICP for each object in the scene is set to zero which makes the result bias towards specific object configurations (e.g. objects standing on the table).
Then we substitute the second stage in the SAND filter method from generative sampling-based local search method to ICP and Fast Point Feature Histograms (FPFH) , but remain first stage as the same. The plots in figure 7 show that our second stage method is significantly better thant ICP and FPFH with the pyramid of heatmap detections as the first stage.
Vi-B2 Categorical Sorting
In the manipulation experiments, the task for the robot is to sort all the objects on a tabletop into different locations on the shelf according to their categories (left column of Figure 2). The robot first performs two-stage method to acquire objects’ poses, and further grasp the objects and place them onto the corresponding locations of the shelf one at a time. We divide the 15 objects into three categories: food & drink, laundry and miscellaneous. The pose estimation will be performed after placing each object. This process is illustrated in Figure 8.
The robot performs sorting sequences in total. Because task completion depends on many factors, such as motion planning and grasping, we split the task into three sub-tasks: pose estimation, and grasping, and placement. In this case, we can further analyze the success rate for each sub-task and determine the bottleneck of the entire system.
Table I shows the success rate for each sub-task in our categorical sorting tasks. There are average objects in each task. For pose estimation, if all the detections are lower than a certain threshold, it considers to be a failure. After successful pose estimation, we further inspect object grasping given the matched poses. As for placement task, as long as the robot is able to place the object to its corresponding area on the shelf, we consider it as a success. Since these three sub-tasks are sequential, we consider the entire task is completed if the placement is successful.
|Pose Estimation||Grasping||Task Completion|
|Success Rate||40/42 (0.952)||37/40 (0.925)||37/42 (0.881)|
According to Table I, the main source of failure is grasping since three out of five failed cases are due to grasping. The end-effector of the Fetch robot is a hard gripper and without any tactile sensors. Therefore, the robot essentially performs open-loop grasping on top of the cluttered tabletop without any feedback. For failed pose estimation, our detector fails to locate the object in the scene, which leads to negative pose match. However, considering the challenging nature of our scenes and tasks, total success rate is promising.
Vii Conclusion and Discussion
In this work, we first present a novel two-stage paradigm, SAND filter, that leverages both CNN and generative sampling-based local search to achieve accurate object detection and six DoF object poses. We further build a manipulation pipeline to perform categorical object sorting tasks.
To perform object manipulation task requires accurate object detection and pose estimation. Our SAND filter enables the robot to perceive the environment regardless of occlusion and clutteredness. We hope our two-stage paradigm would shed light on the challenging nature of perception for manipulation tasks and further progress towards true autonomous manipulation.
In the future, we will enable SAND filter to detect and estimate pose for multiple object instances. Besides, the current SAND filter only works on single observation, and it can be extend to account for sequential observation.
- Besl and McKay  Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. In Sensor Fusion IV: Control Paradigms and Data Structures, volume 1611, pages 586–607. International Society for Optics and Photonics, 1992.
- Ciocarlie et al.  Matei Ciocarlie, Kaijen Hsiao, Edward Gil Jones, Sachin Chitta, Radu Bogdan Rusu, and Ioan A Şucan. Towards reliable grasping and manipulation in household environments. In Experimental Robotics, pages 241–252. Springer Berlin Heidelberg, 2014.
- Collet et al.  Alvaro Collet, Manuel Martinez, and Siddhartha S Srinivasa. The moped framework: Object recognition and pose estimation for manipulation. The International Journal of Robotics Research, page 0278364911401765, 2011.
- Dellaert et al.  Frank Dellaert, Dieter Fox, Wolfram Burgard, and Sebastian Thrun. Monte carlo localization for mobile robots. In Robotics and Automation, 1999. Proceedings. 1999 IEEE International Conference on, volume 2, pages 1322–1328. IEEE, 1999.
- Girshick  Ross Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.
- Girshick et al.  Ross Girshick, Jeff Donahue, Trevor Darrell, and Jagannath Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014.
- Gordon et al.  Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/non-gaussian bayesian state estimation. In IEE Proceedings F (Radar and Signal Processing), volume 140, pages 107–113. IET, 1993.
- Gualtieri et al.  Marcus Gualtieri, Andreas ten Pas, and Robert Platt Jr. Category level pick and place using deep reinforcement learning. CoRR, abs/1707.05615, 2017. URL http://arxiv.org/abs/1707.05615.
- Hastings  W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109, 1970. ISSN 00063444.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In Computer Vision–ECCV 2014, pages 346–361. Springer, 2014.
- Hinterstoisser et al.  S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, , and N. Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. 2012.
- Huang et al.  Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015.
- Isard and Blake  Michael Isard and Andrew Blake. Contour tracking by stochastic propagation of conditional density. In European conference on computer vision, pages 343–356. Springer, 1996.
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Long et al.  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. arXiv preprint arXiv:1411.4038, 2014.
- Narayanan and Likhachev  Venkatraman Narayanan and Maxim Likhachev. Discriminatively-guided deliberative perception for pose estimation of multiple 3d object instances. In Proceedings of Robotics: Science and Systems, AnnArbor, Michigan, June 2016. doi: 10.15607/RSS.2016.XII.023.
- Papazov et al.  Chavdar Papazov, Sami Haddadin, Sven Parusel, Kai Krieger, and Darius Burschka. Rigid 3d geometry matching for grasping of known objects in cluttered scenes. The International Journal of Robotics Research, page 0278364911436019, 2012.
- Redmon et al.  Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. arXiv preprint arXiv:1506.02640, 2015.
- Ren et al.  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.
- Rusu  Radu Bogdan Rusu. Semantic 3D Object Maps for Everyday Manipulation in Human Living Environments. PhD thesis, Computer Science department, Technische Universitaet Muenchen, Germany, October 2009.
- Sermanet et al.  Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Sui et al.  Zhiqiang Sui, Odest Chadwicke Jenkins, and Karthik Desingh. Axiomatic particle filtering for goal-directed robotic manipulation. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 4429–4436. IEEE, 2015.
- Sui et al. [2017a] Zhiqiang Sui, Lingzhu Xiang, Odest C Jenkins, and Karthik Desingh. Goal-directed robot manipulation through axiomatic scene estimation. The International Journal of Robotics Research, 36(1):86–104, 2017a.
Sui et al. [2017b]
Zhiqiang Sui, Zheming Zhou, Zhen Zeng, and Odest Chadwicke Jenkins.
Sum: Sequential scene understanding and manipulation.In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3281–3288, Sept 2017b. doi: 10.1109/IROS.2017.8206164.
- Ten Pas and Platt  Andreas Ten Pas and Robert Platt. Localizing handle-like grasp affordances in 3d point clouds. In Experimental Robotics, pages 623–638. Springer, 2016.
- Uijlings et al.  Jasper RR Uijlings, Koen EA van de Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
Vapnik et al. 
V. Vapnik, S.E. Golowich, and A. Smola.
Support vector method for function approximation, regression estimation, and signal processing.1997.
- Varley et al.  J. Varley, J. Weisz, J. Weiss, and P. Allen. Generating multi-fingered robotic grasps via deep learning. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4415–4420, Sept 2015. doi: 10.1109/IROS.2015.7354004.
- Xiang et al.  Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
- Zeiler and Fergus  Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
- Zitnick and Dollár  C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. In Computer Vision–ECCV 2014, pages 391–405. Springer, 2014.