Never Mind the Bounding Boxes, Here's the SAND Filters

08/15/2018 ∙ by Zhiqiang Sui, et al. ∙ University of Michigan 0

Perception is the main bottleneck to perform autonomous mobile manipulation tasks, especially in cluttered and unstructured environment. In this paper, we propose a novel two-stage paradigm that leverage both CNN object prior and generative sampling to perform object detection and 6D pose estimation. Our two-stage approach builds upon both CNN and generative sampling-based local search method to achieve sampling the network density, or SAND filter. We show the quantitative results that SAND effectively improve object detection result by reducing false positive and false negative recognitions, and further produces accurate pose estimation. We also conduct extensive categorical object sorting experiments to show our method is able to produce accurate and reliable detections and object poses.



There are no comments yet.


page 1

page 2

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robust and reliable operation of autonomous mobile manipulators remains an open challenge for robotics, where perception remains a critical bottleneck. Within the well-known sense-plan-act paradigm, truly autonomous robot manipulators need the ability to perceive the world, reason over manipulation actions afforded by objects towards a given goal, and carry out these actions in terms of physical motion. However, performing manipulation in unstructured and cluttered environments is particularly challenging due to many factors. Particularly, to execute a task with specific grasp points demands first recognizing object and estimating its precise pose. Figure 1 illustrates such a task. The robot moves the objects from the table to the shelf based on their categories.

With the advent of convolutional neural networks (CNN), many challenging perception tasks have been improved significantly, such as image classification 

[14] and object detection [6]. However, relying solely on CNN to detect objects and estimate their poses in a cluttered environment poses several issues. For instance, varying orientations and occlusions may greatly alter the appearance of objects, which may affects the performance of object detectors. Therefore, to avoid making hard decisions on the detection result, we present an alternative paradigm for object detection as well as pose estimation so that it can not only utilize the discriminative power given by deep neural networks but also maintain the versatility and robustness despite the everchanging environments.

Fig. 1: A robot perceiving and sorting objects from a cluttered tabletop. The goal is to move objects from the table to the shelf based on their categories. Our SAND filter enables the robot to detect objects and estimate their poses in a two-stage process.

Our two-stage approach builds upon both CNN and generative sampling-based local search method to achieve sampling the network density, or SAND filter. The first stage of SAND filter attempts to detect objects using CNN and RGB images. However, unlike other popular object detectors, such as Faster RCNN [19]

, we do not perform any filtering over the object bounding boxes, no matter their object confidence scores. Hence, the sampling method in the second stage could take full advantage of the probability density prior provided by CNN. Generative sampling method has been widely used in robot localization 

[4] and object tracking [13] due to its robustness. we, instead, employ such methods to perform local search over the hypothesis space in observed depth images so as to refine the object detection results as well as estimate object poses. Each sample, in our second stage, represents a possible object pose. The weight of each sample is bootstrapped by the prior given by the CNN and re-sampled based on its hypothesized states, from the rendering engine, and the observed state from the depth image. After iterations, the samples will essentially converge and the final state, which can best represent the observed scene.

Fig. 2: Overview of our approach. The robot operates in a cluttered environment and it captures an RGB-D image. Stage 1 takes in an RGB image and generate object bounding boxes with confidence scores. Stage 2 crops the corresponding depth image with the bounding boxes and perform bootstrap filtering to estimate the poses for all objects in the scene. Eventually, the robot performs object manipulation based on the estimated poses.

In this paper, we demonstrate that the second stage of SAND filter can effectively improve object detection result from the first stage, and further produces accurate pose estimation. It is worth noting that our SAND filter does not limit to our own CNN implementation, and it can be adapted to other CNN-based object detectors with minor modification. Likewise, the sampling-based local search in the second stage can also be replaced.

Ii Related Work

Ii-a Perception for Manipulation

PR2 interactive manipulation [2] segmented non-touching objects from a flat surface by clustering of surface normals. Collet et al. presented a discriminative approach, MOPED, to detect object and estimate object pose using iterative clustering-estimation (ICE) using multiple cameras [3]. Papazov et al. [17] used a bottom-up approach of matching the 3D object geometries using RANSAC and retrieval by hashing methods. Narayanan et al.  [16]

integrate A* global search with the heuristics from the neural networks to perform scene estimation assuming known identification of objects.

For manipulation in the cluttered environment, Ten Pas et al. [26] have shown success to detect the handle-like part of the object for grasp poses in cluttered environment and  [8]

et al. tried to perform the pick and place task in the deep reinforcement learning framework. Varley et al. 


developed a grasp pose generation system under partial view of the objects using deep learning.

Ii-B Object Detection

Followed by the success of AlexNet [14], regions with convolutional neural network features, or R-CNN [6], introduced by Girshick et al., has become the dominating method for object detection. It first utilizes low/mid-level features to generate object proposal [27][32]

, and then uses CNN to extract feature within each proposal. Finally, a linear classifier, such as SVM 

[28], is trained using those features for the classification task. [10] and [5]

further optimized feature extraction process in R-CNN. Unified approaches, which integrate object proposal and classification, including our work, are inspired by R-CNN.

Long et al. propose FCN for semantic segmentation by replacing fully connected layers in traditional CNN with convolutional layers [15]. FCN take images of arbitrary size and provide per-pixel classification label. However, FCN are not able to separate neighboring objects within the same category to obtain instance-level label; hence we cannot directly re-task FCN for object detection purpose. Nonetheless, most unified approaches are based on FCN to localize and classify objects using the same networks.

Recently, there has been a trend to utilize FCN to perform both object localization and classification [21][18][19][12].

Sermanet et al. propose a integrated CNN framework for classification, localization and detection in a multiscale and sliding window fashion [21]. All three tasks are learned simultaneously using a same shared network. Ren et al. expand the approach of [5] by taking the same networks for classification task and repurposing them for generating object proposals. Redmon et al. use a different approach by dividing the image into regions using a single network, and predicting bounding boxes and classification score for each region [18].

Our approach is inspired by Faster R-CNN [19] and pose estimation by generative sampling methods [23],  [24],  [25]. However, our approach proposes a two-stage framework to further address the challenge that detecting objects and estimating poses in a cluttered environment by enabling the generative sampling methods with the full potential of deep neural networks.

Fig. 3: Factor graph representation of our two-stage approach.

Iii Problem Formulation

Given an RGB-D observation (,

) from the robot sensor, our aim is to estimate the joint distribution

, where is the six DoF object pose in the world frame, which comprises 3D spatial location and 3D orientation, is the object bounding box with a confidence score in the 2D image-space and is the object label with its corresponding 3D geometry model. Figure 3 illustrates the formulation using factor graph for each object and can be represented as the following formulation:


Equation 2 and 3 are derived using chain rule statistics and equation 4 represents the factoring of object detection, pose estimation and the observation prior. Here, we assume that pose estimation is conditionally independent of RGB observation, , and object detection is conditionally independent of depth observation, .

Ideally, we could use Markov chain Monte Carlo (MCMC) [9] to estimate the distribution of Equation 1. However, the state space of the entire states are so large which makes it intractable to directly compute. End-to-end neural network method can also be also used to calculate the distribution. For instance, PoseCNN attempts to estimate object pose given RGB images only within a single CNN framework [30]. However, PoseCNN requires significant amount of data and human annotation in order to train the CNN. Our paradigm, on the other hand, is able to compensate the data deficiency by employing a generative sampling method in the second stage. SUM [25] implements a simple combination of Equation 1 to enable sequential manipulation. However, in [25], data association is required to track the location of objects over time, which may lead to prolonged mis-detections if given malignant initial estimation. Furthermore, SUM may suffer from inevitable false detection, and hence, poor pose estimation, because a hard filtering is performed after object proposal and detection stage.

Iv Inference Method

We propose a two stage paradigm to compute object detection factor and pose estimation factor in two stages. Figure 2 illustrates the overview of our method. Our robot first has to estimate pose for each object given an RGBD image under cluttered environment. In the first stage, our CNN localize objects and give each bounding boxes a confidence score. Then, in the second stage, we perform generative sampling-based optimization to estimate the object pose given a depth image and object bounding boxes with scores. Once the pose of an object has been estimated, the robot picks up the object and places it on a shelf categorically. We hope that the heuristics from the first stage can better inform the generative sampling optimization in the second stage and the generative sampling can help check on the false detections from the first stage.

Instead of directly computing Equation 1

because of the curse of dimensionality, we aim to maximize the joint probability by finding a pose

and can be defined as


Thus, due to the nature of , our paradigm is limited to estimating one instance of an object class in the scene. However, we will extend our paradigm to accommodate multiple instances by estimating the joint probability in the future.

Fig. 4:

Network Architecture. Our CNN are based on VGG-16 network architecture, and have two branches, classifier and aspect ratio, to detect objects. The blue layers are convolutional layers and green layers are max pooling layers.

Iv-a Object Detection

The goal of our stage one method is to provide object bounding boxes with confidence scores given an object class . To achieve this, we exploit the discriminative power of CNN. Inspired by region proposal networks (RPN) in [19], our CNN serve as a proposal method for the second stage. However, instead of only classifying objects as object and non-object, our networks are able to produce the object class labels.

We choose VGG-16 networks [22] as our base networks. VGG-16 has

weight layers coupled with ReLU layers and max pooling layers. To enable VGG-16 to perform object detection at each window location, we replace fully connected layers with

convolutional layers to construct fully convolutional networks (FCN), such as in [15]. Consequently, our networks “convolve” the input RGB image in a sliding window fashion.

Beyond the classification output, we would also like to predict the aspect ratio of the object bounding box. After conv2 layer, our CNN extend another branch to predict the aspect ratio of the bounding boxes, which is class agnostic. The detailed architecture is illustrated in Figure 4. Here, we choose not to adopt bounding box regression or anchors method. Because of occlusions and varying poses of objects(Figure 1), it’s challenging to estimate the exact 2D coordinates of objects with regression or predict aspect ratios based on statistics. Hence, our shape branch of CNN does not require class specific feature from later layers [31], such as conv5, and simply intends to provide aspect ratio prior and leave the exact localization to the second stage.

The input to our networks is a pyramid of images with different scales. This is to enable the networks to capture objects of all sizes. The output, thus, is a pyramid of heatmaps with another pyramid of shape maps. Each pixel in the heatmap, with the corresponding pixel in the shape map, represents a bounding box in the input image. For each bounding box, there is a categorical distribution that represents possible outcomes of all object classes. Therefore, the bounding boxes, , received by the next stage, is a list of 5-tuple that represents object 2D coordinates and a confidence score, , given an object class .

Iv-B Pose Estimation

The purpose of the second stage is to estimate the object pose, , and further refine the bounding box, , based on the estimated pose, by performing generative sampling-based local search. Our local search is inspired by sampling methods, such as bootstrap filter [7], which offer us robustness and versatility over the search space, which is critical in our context, since the result of the first stage may be imperfect and the manipulation task depends on the accuracy of the pose. Hence, we expect the second stage to improve object localization, or even correct false detections, based on the result of the first stage.

A collection of weighted samples, , to represent multiple hypotheses that indicate the states of object poses. Given an object class , we have a corresponding object geometry model , and therefore can render a point cloud, , using the z-buffer from a 3D graphics engine, given an object pose and camera view matrix . Essentially, these rendered point clouds are our collection of samples .

The initial samples are determined by the output of the first stage. Recall that in Section IV-A, our CNN produce a density pyramid which is a list of bounding boxes with confidence scores, . We perform the importance sampling over the confidence scores to initialize our samples. The bounding box with a higher confidence score will get more samples to search over.

To evaluate a sample state, we first crop the depth image with the corresponding bounding box, and then back-project it into a point cloud, .Note that can be different for different samples as they associate with different bounding boxes. We measure the “similarity” between the rendered point cloud and observed point cloud by counting how many points they match with each other. First, we define the inlier function as the following,


where is a point in the observed point cloud , and is a point in the rendered point cloud, . If the Euclidean distance between an observed point and a rendered point is within a certain sensor resolution range, , the total number of inliers will increase by 1. The number of inliers is then defined as


where and are 2D indices in the observed point cloud . Next, the weight of each hypothesis is defined as,


where is the number of points in the observed point cloud , is the number of points in the rendered point cloud and , and are coefficients. The first term in equation 9 weighs how much the rendered point cloud match within the bounding box with the observed point cloud. However, since the bounding boxes from Section IV-A are not perfect and usually truncate the objects, the second term is used to accommodate this by weighing how much the current hypothesis can explain itself not only in the bounding box but in the scene. We further blend in the object confidence score from the previous stage to balance between the two stages.

To get the the optimum pose , we follow the procedure of importance sampling to assign a new weight to each sample. During the re-sampling process, each pose,

, would also be perturbed by a normal distribution in the space of six DoF. Once the average weight is above a threshold,

, we consider the local search is converged, and , which is the sample with maximum weight, can best approximate the true object pose.

V Implementation

We use PyTorch

111 for our CNN implementation. The aspect ratio branch of can predict aspect ratios. One training image contains only one object, and the size is

. The aspect ratio of an object in the training image can be inferred from the width and height of the object. We ignore the aspect ratio labels of background images during training. The activation function for both branches of our CNN is


since we consider predicting aspect ratio as a classification task as well. Thus, the loss function in training phase is

cross entropy.

Our second stage local search method relies on OpenGL graphics engine to render depth images, given a 3D geometry model and a camera view matrix. During the local search process, we allocate samples for each iteration and perform iterations in total. After the final iteration, we select the sample with highest weight and consider it to be the estimated object pose.

Followed by pose estimation, we further generate grasp poses of the object based on the method in [26]. To perform categorical sorting task as illustrated in Figure 2, we use a Fetch robot to perform object grasping and placement. The Fetch robot is a mobile manipulator with seven DoF arm and a pan-tilt head equipped with an RGB-D sensor.

Vi Experiments

Fig. 5: Tabletop object dataset. We have 15 objects total. For each scene, we randomly place a set of objects in the scene.

Vi-a Dataset

Our dataset contains images with object classes and are collected in a cluttered tabletop environment as shown in Figure 5. All objects are labelled with 2D bounding boxes and six DoF poses. We use images for training and parameter tuning and images for testing.

Vi-B Results

We first present the result and analysis on pose estimation accuracy, followed by the robot categorical sorting tasks.

Vi-B1 Pose Estimation

In the estimation experiments, we evaluated the performance of SAND filter method on 28 test images in the dataset. We employ the evaluation metric from 

[11] to measure the mean of the pairwise point distance between the ground truth point cloud and the rendered point cloud from the estimated 6D pose. This metric can also account for objects with symmetric axis, which is useful in comparing household objects.

We first compare the SAND filter with a baseline method where it takes the state-of-art object detector Faster-RCNN [19] as the first stage and the widely used pose estimation algorithm ICP [1] as the second stage. Shown in Figure 6, the SAND filter method is consistently better than the baseline method over all the thresholds. For the tighter distance threshold (e.g. 0.02 meter), our approach is able to achieve over sixty percent accuracy, which greatly outperforms the baseline method. This is due to the ability that our second stage can iteratively narrow down the search space through sampling-based local search. Note that for the baseline approach, the orientation initialization of the ICP for each object in the scene is set to zero which makes the result bias towards specific object configurations (e.g. objects standing on the table).

Then we substitute the second stage in the SAND filter method from generative sampling-based local search method to ICP and Fast Point Feature Histograms (FPFH) [20], but remain first stage as the same. The plots in figure 7 show that our second stage method is significantly better thant ICP and FPFH with the pyramid of heatmap detections as the first stage.

Fig. 6: Pose estimation accuracy-threshold curves for our SAND filter approach and the baseline approach. The x-axis is the average distance to the ground truth object and the y-axis is the pose accuracy based on the metric.

Vi-B2 Categorical Sorting

In the manipulation experiments, the task for the robot is to sort all the objects on a tabletop into different locations on the shelf according to their categories (left column of Figure 2). The robot first performs two-stage method to acquire objects’ poses, and further grasp the objects and place them onto the corresponding locations of the shelf one at a time. We divide the 15 objects into three categories: food & drink, laundry and miscellaneous. The pose estimation will be performed after placing each object. This process is illustrated in Figure 8.

The robot performs sorting sequences in total. Because task completion depends on many factors, such as motion planning and grasping, we split the task into three sub-tasks: pose estimation, and grasping, and placement. In this case, we can further analyze the success rate for each sub-task and determine the bottleneck of the entire system.

Table I shows the success rate for each sub-task in our categorical sorting tasks. There are average objects in each task. For pose estimation, if all the detections are lower than a certain threshold, it considers to be a failure. After successful pose estimation, we further inspect object grasping given the matched poses. As for placement task, as long as the robot is able to place the object to its corresponding area on the shelf, we consider it as a success. Since these three sub-tasks are sequential, we consider the entire task is completed if the placement is successful.

Fig. 7: Pose estimation accuracy-threshold curves for comparing different second stage method for the SAND filter approach. The x-axis is the average distance to the ground truth object and the y-axis is the pose accuracy based on the metric.
Pose Estimation Grasping Task Completion
Success Rate 40/42 (0.952) 37/40 (0.925) 37/42 (0.881)
TABLE I: The tables shows results of categorical sorting experiments for 9 sequences.

According to Table I, the main source of failure is grasping since three out of five failed cases are due to grasping. The end-effector of the Fetch robot is a hard gripper and without any tactile sensors. Therefore, the robot essentially performs open-loop grasping on top of the cluttered tabletop without any feedback. For failed pose estimation, our detector fails to locate the object in the scene, which leads to negative pose match. However, considering the challenging nature of our scenes and tasks, total success rate is promising.

Fig. 8: The robot executes categorical sorting tasks based on SAND Filter in one action sequence.

Vii Conclusion and Discussion

In this work, we first present a novel two-stage paradigm, SAND filter, that leverages both CNN and generative sampling-based local search to achieve accurate object detection and six DoF object poses. We further build a manipulation pipeline to perform categorical object sorting tasks.

To perform object manipulation task requires accurate object detection and pose estimation. Our SAND filter enables the robot to perceive the environment regardless of occlusion and clutteredness. We hope our two-stage paradigm would shed light on the challenging nature of perception for manipulation tasks and further progress towards true autonomous manipulation.

In the future, we will enable SAND filter to detect and estimate pose for multiple object instances. Besides, the current SAND filter only works on single observation, and it can be extend to account for sequential observation.