To make autonomous robots taskable such that they function properly, meet human expectations, and interact fluently with human partners, they must be able to perceive and understand the semantics of their environments [laird2017interactive]. More specifically, as robots move around in the environment, they must know what objects are presented as well as their locations. It is desired for robots to understand the semantic aspects of the scene and build a map at the object-level. An object-oriented representation of the world is a natural and efficient way for robots to make high-level decisions using task-level planners and help them to communicate with human users.
The challenge is that many semantic aspects of the world are difficult for robots to sense directly due to the limited field-of-view and noisy observations from their onboard sensors. Great progress has been achieved by the advances in deep neural networks for object detection [ren2015faster, liu2016ssd, he2017mask] and 6D object pose estimation [xiang2017posecnn, tremblay2018deep, wang2019densefusion]. However, building a semantic map at the object-level remains a challenging problem, especially in densely cluttered environments such as the one shown in Fig. LABEL:fig:teaser.Because in such dense clutter, objects are often in close contact, causing severe or complete occlusions.
One promising approach for building object-level maps is to fuse semantic measurements from different viewpoints. Robots can take advantage of their ability to move around the environment and provide continuous observations. More specifically, the data association between measurements and objects are first determined [wong2015data] and probabilistic model are built over objects and robot poses to infer the semantic map (SLAM++ [salas2013slam++], Fusion++ [mccormac2018fusion++]) or belief over objects (CT-MAP [zeng2018semantic]). Such methods perform inference directly on the object instance instead of low-level primitives (points, surfels). This approach to inference offers faster inference, more compact map representation and the potential for dynamic object reasoning. We posit that this approach is particularly advantageous in dense clutter. The geometric properties of object instances and geometric relationship between objects can be used for more robust data association and accurate pose estimation.
In this work, we present GeoFusion, a SLAM-based approach for inferring object labels and 6D poses from continuous noisy measurements in dense clutter by exploring the geometric consistency of the object instances. The geometric consistency is defined in two parts: geometric consistency score and geometric relation. The geometric consistency score describes the compatibility between object geometry model and observation point cloud. Meanwhile it provides a reliable measure to filter out false positives in data association. The geometric relation represents the relationship (e.g. contact) between geometric features (e.g. planes) among objects. Moreover, geometric relation is directly amenable to make better decisions for high-level task planners [PDDL:1998, asp].
Given noisy measurements from state-of-the-art object detection [he2017mask] and pose estimation [wang2019densefusion] systems, our method first determines the correct correspondences between measurements and objects with geometric consistency between measurements and point cloud observations. The associations are then used to build a factor graph along with the geometric relations between objects. We use the fast optimization technique to get the maximum likelihood estimation of object and robot poses. We quantitatively evaluate our method and demonstrate that GeoFusion is able to estimate geometrically stable scene and be robust against false estimates from state-of-art frame-by-frame estimation system [wang2019densefusion] and outperforms the baseline methods that do not consider geometric consistency.
Ii related work
Ii-a Semantic Mapping
From the emerging of semantic mapping [kuipers2000spatial], lots of work have explored this field with various semantic representations [kostavelis2015semantic]. With the focus on object-level semantic map, one widely used approach is to reconstruct a 3D map from either sparse features [murAcceptedTRO2015] or dense point clouds [newcombe2011kinectfusion, whelan2015elasticfusion] and then augments the map with the objects. Civera et al. [civera2011towards] and Ekvall et al. [ekvall2006integrating] used SURF/SIFT descriptors to register objects to the map created in a parallel thread. However, their work can not deal with objects in clutter. To deal with clutter, Li et al. [li2016incremental] proposed an incremental segmentation approach to fuse segmentation across different frames based on dense reconstruction 3D map and used ObjRecRANSAC [papazov2010efficient] to register object poses. But their method does not scale well with substantial false detections.
Salas-Moreno et al. [salas2013slam++] proposed an object SLAM system that recognized objects using Point-Pair Features in each frame and directly performed graph optimization on object and camera 6D poses. Their work shows the promising result in the direction of building maps with 6D object poses. McCormac et al. [mccormac2018fusion++] constructed an online object-level SLAM system with previously unknown shape and perform pose graph optimization at object level. Both of their work assume independence between objects and hence difficult to work well in densely cluttered environments. Zeng et al. [zeng2018semantic] proposed CT-MAP, which considers contextual relation between objects and temporal consistency of object poses in a conditional random field and maintain belief over object classes and poses. This approach is robust to false detections; however, does not scale well to the increasing number of objects due to the nature of pure generative inference.
Ii-B Scene Estimation
Great progress has been made by deep neural networks in object detection [ren2015faster, he2017mask], 6D object pose estimation [xiang2017posecnn, wang2019densefusion] and semantic segmentation [long2015fully, papandreou2015weakly]. To be robust to false detections due to the effects of overfitting, discriminative-generative algorithms [joho2013nonparametric, narayanan2016discriminatively, sui2017sum] have been proposed for robust perception, especially in adversarial environments [chen2019grip]. These efforts combine inference by deep neural networks with sampling and probabilistic inference models to achieve robust and adaptive perception. However, the robustness is at the cost of computation time and the assumption object independence.
Sui et al. [sui2017goal] propose an axiomatic scene estimation method to estimate both geometric spatial relationship between objects and their 3D object poses for goal-directed manipulation. The spatial relation pose strong constraints to reduce the search space of object poses. Desingh et al. [desingh2016physically] and Mitash et al. [mitash2019physics] leverage physics into scene estimation to search object poses based on compatibility score from physics simulation. The above mentioned methods take either geometric and physical constraints into account for searching object poses, at the cost of computational efficiency.
Iii Problem Formulation
Our aim is to estimate the semantic map composed of a collection of static objects, with known 3D object geometries as well as their geometric relationships . Each object contains object class and 6D object pose . Each describes the geometric spatial relation between geometric features of two objects (e.g. support).
When the robot moves around in the environment, it observes a set of measurements , where is the total time step robot has travelled, and is the number of measurements at each time step. Similar to the definition of object, each semantic measurement is comprised of object class and 6D pose. The robot poses represent as , where is also 6D pose in our case. In addition, the correspondence between objects and measurements also needs to be determined and is defined as where stipulates that measurement corresponds to object .
A complete statement of this problem is the maximum likelihood estimation of , , and given the measurements and the geometric relation
is computed heuristically from:
As the joint estimation of objects, data association and robot poses suffers from the high dimensionality, we decompose it into two separate estimation problems. The maximum likely estimation data association is first computed given initial estimates of robot poses and objects . Then given computed , the most likely robot poses and objects are estimated:
Again, we decompose Eq. 3 to a two-step optimization where the object and robot poses are first optimized and the geometric relation will be computed heuristically from and these geometric relations will, in turn, pose constraints on factor graph for estimate geometrically stable scene:
Iv Data Association
As the number of objects is not known as prior knowledge, we proposed a non-parametric clustering based approach for data association based on DPmeans algorithm [kulis2011revisiting], as shown in Algorithm 1. Given initial estimates of objects , robot poses and measurements , we compute how likely each measurement belongs to current objects, being a new object or a false positive. The objects are created, updated and deleted dynamically as the data association process moves forward.
For each measurement in frame , the likelihood of being assigned to object will be calculated as follows:
where is the label factor and it is obtained using an indicator function that evaluates to 1 if and 0 otherwise, meaning that each measurement will only be assigned to a object with the same label.
Second term captures the geometric consistency score between measurement and the observation point cloud and provides a measure of how reliable the given measurement is. The motivation behind this term is that the confidence score given by neural networks sometimes are not accurate enough to describe the reliability of the measurement. Given a measurement , a point cloud is back-projected from the rendered depth image and compared with observation point cloud to compute projective inlier ratio
, outlier ratioand occlusion ratio . A point in the rendered point cloud is first considered as an occlusion point if it is occluded by the projective point in the observation point cloud shooting from the camera ray. If not, it is considered as an inlier if the distance with the projective point is less than certain sensor resolution , or outlier if the distance is greater than . Fig. 1 shows an example of different type of points. The ratios are computed over the total number of rendered points. And they are used as follows:
is the modified sigmoid function providing an S-shaped logistic curve that is well-adapted to reflect the changing tendency of confidence on the given measurement over different values of each ratio.
The last term is the pose factor that reflects the similarity of measurement pose and object pose, given the current robot pose . Here, we make the assumption that the likelihood of each measurement
given the robot pose and the object pose follows a multivariate Gaussian distribution.
is the measurement noise matrix and this factor can thus be drawn from the Gaussian probability density function. With all the factors described above,is assigned to be the maximum likelihood object if the object is within certain threshold , otherwise, it is assigned to a new object.
The One Measurement Per Object (OMPO) constraint [wong2015data] is also applied in this process so that two measurements in the same frame will never be assigned to the same object and the latter will be assigned to a new object instead.
Iv-2 False positive removal
False positive score
For each object candidate created or updated in the association step, we compute its false positive score as follows:
where is the number of measurements assigned to and is the maximum geometric consistency score among all of these measurements. If there are more measurements assigned to one object, it is reasonable to consider this object as more reliable among others thus it deserves a lower false-positive score. All objects with false-positive scores higher than will then be deleted.
Overlapped objects merging
We notice that most of the false positive measurements usually overlap with other objects, especially for symmetric objects, like a cylinder-shape tomato soup can. Object-level geometric consistency is incorporated again to address this problem as we applied Separating Axis Theorem (SAT) [gottschalk1996obbtree] to detect the collision ratio of the three-dimensional bounding boxes of any two objects. Corner points of each bounding box are projected to several main axes and the overall collision ratio is decided based on the percentage of overlap along each axis. One object with high collision ratio will be merged to the other one that has a lower false-positive score. A Non-Maximum Suppression (NMS) algorithm [neubeck2006efficient] is also adopted to ensure we find the best candidate when there are multiple objects that overlap with each other.
V Graph Optimization
As shown in Figure 2, our optimization process is decomposed into two stages. First, 6D pose of the robot and the objects are optimized over odometry and measurements. In the second stage, 6D pose of objects are then fine-tuned over the inferred geometric constraint to generate a geometric consistent scene estimate. To simplify the notation, 6D pose is represented in two ways: a transformation matrix or a combination of quaternion and translation . Denote as the mapping from transformation matrix to its corresponding quaternion representation .
V-a Stage I Optimization
V-A1 Odometry Cost
Odometry measurements provided by ORB-SLAM2 [murAcceptedTRO2015] are used as our initial odometry measurements. The odometry cost describes the difference between the estimated odometry and measured odometry. Given the estimated robot poses , and the measured odometry , we aim to optimize over robot poses and to minimize the following odometry cost.
where is a diagonally weighted norm and is computed as . is a diagonal matrix and represented as , and is the weight of odometry translation. , and is the weight of odometry rotation. The following appearances of denote the weight of each individual equation unless explicitly explained.
V-A2 Measurement Cost
Measurement cost constraints the discrepancy between object measurements and estimated object states. As mentioned before, neural networks often produce more false positives for symmetric objects due to their symmetries, we design separate cost functions for symmetric and asymmetric objects.
Given robot pose , measurement for object , we can get an estimation for at time step t. We aim to optimize over and to minimize the difference between and . The relative transform between measurements and optimization is defined as follows:
For asymmetric object, the cost function is as follows:
Here , and .
As for symmetric object with rotation axis in , the cost function is as follows:
where and . The last term of is set to zero to exclude the influence from the z-axis. For those objects whose rotation axis is in other axis, the cost functions can be derived in the similar way.
V-B Object Geometric Relationship Inference
In real world, geometric relationship between objects can be really complicated. One object can support the other one, be parallel to the other one, or be perpendicular to the other one. If we want to use geometric relationship to optimize object poses, we have to be really conservative to those relationship. A wrong inference will be disastrous to the entire optimization.
For this reason, contact relationship is the only geometric relationship we choose for optimization. As we believe contact relationship is the most common and reliable geometric relationship between objects in real scenes, for objects in real scenes have to be put on something and contact with other objects.
V-B1 Geometric Surface Feature
Surface geometric features are the keys to determine the existence of contact between different objects. In order to model the geometry of object as general as we can, instead of modeling a object as a particular 3D shape like cube, cylinder and etc, a bunches of surface features are used to represent 3D geometry of an object. Through this way, an arbitrary convex polyhedron can be easily handled since it can be represented by a set of plane surface features.
In this paper, geometric surface features are divided into two classes: plane surface feature and curved surface feature. An illustration of those surface features is shown in Fig 3.
Plane surface feature refers to those planes on polyhedrons. There are three attributes bounded with plane surface feature: Center points , describing the 3d position of the plane center in object local frame. Boundary points . Each is the 3D position of a boundary point of the surface and is the total number of boundary points. Normal direction , is the normal direction of plane feature in object local frame.
Curved Surface Feature
Curved surface feature describes those curved surface in particular 3D shapes, like cylinder and etc. Noticeably, edges of polyhedrons are also regarded as a curved surface feature with a radius equal to 0. Five attributes are associated with curved surface feature: and , determining the 3D position of center and direction of rotation axis of the curved surface. refers to the 3D position of boundary points in rotation axis in object local frame. Radius describes the radius of curved surface.
Based on above surface features, we can get three different types of contact relationship between features. They are plane-plane contact (P2P), plane-curved surface contact (P2C), curved surface-curved surface(C2C) contact. Fig 4 illustrates these geometric relations graphically.
V-B2 Geometric Feature Relationship Inference
If two different plane feature from two objects are contacting each other, their attributes should follow the following rules:
where are the direction threshold and distance threshold for P2P contact.
If a plane feature and a curved surface feature are contacting each other, their attributes should follow the following rules:
where are the direction threshold and distance threshold for P2C contact.
If two different curved surface feature are contacting each other, their attributes should follow the following rules:
where is the distance threshold for C2C contact.
V-B3 Physical Relationship Check
Apart from using geometric methods to determine the contact relationship between objects, physical based checking are also used in geometric relationship inference.
By using geometric based method, we can find many contact feature pair candidates in a scene. However, not all candidates can be used in optimizations. Because there are still many false contact inference in this way. For example, if two cubes are placed parallel and close to each other on the table, we can detect a P2P contact between these two objects. But there is a chance that those objects are just near to each other but not contacting each other. To avoid those false contact judgment, we add two more physical checks for those contact feature pair candidates. The main idea is that we only consider those contact relationship resulted from gravity, which in other word, those support relationship.
Support direction check
If a plane feature wants to support another plane or be supported, its normal direction can not be horizontal. Assume the direction of gravity is . The support direction check of plane feature is as follows:
where is the threshold for support direction check.
Qualitative Support projection check
If two features have a support relationship, their 2D projection along the direction of gravity must be overlapped. For those contact feature candidates, their 2D projection is the polygon or line projected from their boundary points. Separate axis theorem (SAT) is applied to check the overlap status of those 2D figures. Those feature candidates whose 2D projection is not overlapped with each other will be removed.
V-C Stage II Optimization
Geometric constraints between two objects are the combination of geometric constraints between their features, in another word, those P2P, P2C and C2C relationship.
Given two objects, and . If a P2P contact is detected between their plane feature . Denote the relative transform of those features to the corresponding objects as . (The local coordinate of a feature is a coordinate whose origin is located at and z-axis is equal to .)
Then we can make optimization over and to minimize the following cost:
Given two objects, and . If a P2C contact is detected between their plane feature and curved surface feature . We make optimization over and to minimize the following cost:
Given two objects, and . If a C2C contact is detected between their curved surface features and . we can optimize over and to minimize the following cost.
We use Mask R-CNN [he2017mask] and DenseFusion [wang2019densefusion] to generate semantic measurements where the former detects object along with the instance segmentation which the latter takes in to estimate the 6D pose. We finetune the Mask R-CNN with the YCB-Video Dataset [xiang2017posecnn] with the pretrained Microsoft coco model [lin2014microsoft]. We use the public available DenseFusion weights and implementation without fine-tuning. The front-end in our implementation selects every 10th camera frame as a keyframe. The camera visual odometry is provided by ORB-SLAM2 [murAcceptedTRO2015]. We use Ceres [ceres-solver] as the optimization backend. In our implementation, we perform stage I optimization every 10th keyframe and perform stage II optimization every keyframe.
Vi-B Dataset and Baseline
To test the performance of our method, we collect a testing dataset from 6 RGBD video streams in which our Michigan Progress Robot moves around the table. In each scene, we put around 18 objects in dense clutter on the table and collect around 200 keyframe images from the robot’s sensor. The groundtruth object classes and poses are labelled using LabelFusion [marion2018label]. For baseline methods, we compare our method with frame-by-frame (FbF) estimation from DenseFusion [wang2019densefusion] and the variations of our method: 1) a baseline SLAM (B-SLAM) method without considering geometric consistency in both data association and graph optimization. The confidence score from DenseFusion instead of the geometric consistent score is used in computing data association. 2) a robust front-end (R-Front) method where only our data association is used and not the graph optimization.
Vi-C1 Object Detection
We first compare object detection performance with baseline methods on each keyframe collected in the dataset. We use the most common object detection metric: mean average precision (mAP), to measure object detection. Average precision is the area under the PR-curve of an individual object class and mean average precision is the mean of all the AP. The subscript under mAP is the ratio of intersection over union (IoU) between the estimated bounding box and the ground truth bounding box. indicates a detection is a true positive if the IoU is over 0.5 and sets a more strict criteria where a detection is true positive only if the IoU is over 0.75. Results are shown in Table I. Our method significantly outperforms the frame-by-frame method in both metrics showing that our method is capable of identifying false detections and keeping true ones. Robust front-end method achieves the similar with GeoFusion however underperforms in . It shows that the back-end optimization improves the 6D poses of objects, which in turn increases the mAP with a tighter threshold. The baseline SLAM method does not take any geometric consistency into account and is even worse than the frame-by-frame method. This is because the false and noisy measurements accumulate in the baseline slam and there is no way to remove them which in turn proves the geometric consistency play a vital role in semantic mapping at object-level.
Vi-C2 Pose Accuracy
For 6D object pose accuracy, we use the ADD-S metric [xiang2017posecnn] to compute the average point distance between estimated pose and the correct pose. Only true positives are computed at each frame for each method. As our aim is to build a high accuracy object-level map and the precision required by the robot to grasp objects are relatively high, we plot the accuracy-threshold curves within the range of [0.00m, 0.02m] as shown in Fig. 5. Our method achieves the highest pose accuracy among all the methods and about 90% percent of the objects have achieved pose error under 1cm. The pose accuracy of baseline SLAM method is even worse than the Frame-by-Frame method indicating that the confidence score from perception module are sometimes not reliable.
Vi-C3 Geometric Relation
Fig. 7 compares the results of two optimization stage of the scene in Fig. LABEL:fig:teaser. Left figure is the result of optimization stage I without geometric relation constraints and the right figure is the result after optimized over geometric constraints. The geometric constraints pulled objects to satisfy the constraints. For example, the objects on the right are more aligned with the table and the intersection between objects are fixed.
Vi-C4 Run Time
We show an run time analysis on each frame for data association and optimization of an example scene in Fig. LABEL:fig:teaser. Note that our method only runs on CPU (without measurements generation), the average time for each frame is under 200 ms
In this work, we present GeoFusion, a SLAM-based scene understanding method for building a semantic map at object-level in dense clutter while taking into account geometric consistency. Our method is able to infer object labels and 6D poses from noisy semantic measurements robustly and efficiently. The reasoning at object-level with geometry offers a fast and reliable way to filter out false positives and constraint the object through geometric relation. The computed geometric relations are also directly amenable to high-level task planners for robots to reason over actions for goal-directed manipulation tasks.