Cube SLAM 个人注释版
Object-level data association and pose estimation play a fundamental role in semantic SLAM, which remain unsolved due to the lack of robust and accurate algorithms. In this work, we propose an ensemble data associate strategy to integrate the parametric and nonparametric statistic tests. By exploiting the nature of different statistics, our method can effectively aggregate the information of different measurements, and thus significantly improve the robustness and accuracy of the association process. We then present an accurate object pose estimation framework, in which an outlier-robust centroid and scale estimation algorithm and an object pose initialization algorithm are developed to help improve the optimality of the estimated results. Furthermore, we build a SLAM system that can generate semi-dense or lightweight object-oriented maps with a monocular camera. Extensive experiments are conducted on three publicly available datasets and a real scenario. The results show that our approach significantly outperforms state-of-the-art techniques in accuracy and robustness.READ FULL TEXT VIEW PDF
Simultaneous mapping and localization (SLAM) in an real indoor environme...
ICP algorithms typically involve a fixed choice of data association meth...
Recent Semantic SLAM methods combine classical geometry-based estimation...
We propose GeoFusion, a SLAM-based scene estimation method for building ...
This paper presents a semantic planar SLAM system that improves pose
In an effort to increase the capabilities of SLAM systems and produce
The Visual SLAM method is widely used in self localization and mapping i...
Cube SLAM 个人注释版
Ensemble Data Association for Monocular Object SLAM
Conventional visual SLAM system has achieved significant success in robot localization and mapping tasks. More attentions in recent years are emphasized on enabling SLAM to serve for robot navigation, object manipulation, environment understanding, and other high-level needs. Along with the development of object recognition techniques, semantic SLAM provides a promising solution for these applications.
Semantic SLAM leverages segmentation methods to label map elements and focuses on an overall expression of the environment. The typical approaches usually depend on RGB-D cameras and require a large amount of storage for map representation. As a specific branch of semantic SLAM, object SLAM aim to leverage the semantic information of objects to improve the robustness and accuracy of camera pose estimation, or the ability of semantic environment reconstruction, which exhibits significant advantages in many applications [5, 16, 20] and has drawn many attentions from the communities. In this work, we further extend the meaning of object SLAM by enabling it to build lightweight and object-oriented maps, as demonstrated in Fig. 1, in which the objects are represented by cubes or quadrics with their locations, poses, and scales accurately registered. The challenges of object SLAM mainly lie in two folds: 1) Existing data association methods [22, 11, 7] are not rather robust or accurate for tackling complex environments with multiple object instances, and there does not exist a practical solution to systematically address this problem. For instance, some studies, e.g., , may directly assume the problem is well-solved. 2) Object pose estimation is not accurate, especially for monocular object SLAM. Although some improvements are achieved in recent studies [25, 21, 17], a relatively complete observation of the object is typically required, which is difficult to achieve in real-world applications.
In this paper, we propose the EAO-SLAM, a monocular object SLAM system that can effectively address the data association and pose estimation problems. Firstly, leveraging the different measurements in SLAM, we integrate the parametric and nonparametric statistic testings, as well as the traditional IoU-based method, to to conduct model ensembling for data association. Compared with conventional methods, our approach sufficiently exploits the nature of different statistics, and exhibits significant advantages: 1) The association is based on the ensemble of multiple models, which can tackle most cases robustly; 2) We additionally introduce a new testing method, called double-sample t-test
double-sample t-test, to assist the association process, which can effectively handle challenging cases and help improve the overall association accuracy. In terms of object pose estimation, we propose a centroid and scale estimation algorithm and an object pose initialization algorithm, based on the isolation forest (iForest) and several proposed score functions, to improve the optimality of estimation results. The algorithms are robust to outliers and present high accuracies, which significantly facilitate the joint optimization of object and camera poses.
The contributions of this paper are summarized as follows:
We propose an ensemble data association strategy that can effectively aggregate different measurements of the objects to improve association accuracy.
We propose an object pose estimation framework based on iForest, which is robust to outliers and can accurately estimate the locations, poses, and scales of objects.
Based on the proposed method, we implement the EAO-SALM to build lightweight and object-oriented maps.
We conduct comprehensive experiments and verify the effectiveness of our proposed methods on three publicly available datasets. The source code is also released111https://github.com/yanmin-wu/EAO-SLAM.
Data association is an indispensable ingredient for semantic SLAM, which is used to determine whether the object observed in the current frame corresponds to an existing object in the map. Bowman et al.  use a probabilistic method to model the data association process and leverage the EM algorithm to find correspondences between observed landmarks. Subsequent studies [22, 24] further extend the idea to associate dynamic objects or conduct semantic dense reconstruction. These methods can achieve high association accuracy, but can only process a limited number of object instances. Their efficiency also remains to be improved due to the expensive EM optimization process.
Object tracking is another commonly-used approach in data association, For instance, the Kalman filter is used to associate vehicles by predicting their locations in consecutive frames[26, 3]. Li et al.  project 3D cubes to image plane, and then leverage the Hungarian tracking algorithm to conduct association using the projected 2D bounding boxes. Tracking-based methods are with high runtime efficiency, but can easily generate incorrect priors in complex environments, yielding incorrect association results.
In recent studies, more data association approaches are developed based on the maximum shared information. Liu et al.  propose the random walk descriptors to represent the topological relationships between objects, and those with the maximum number of shared descriptors are regarded as the same instance. Instead, Yang et al.  propose to directly count the number of matched map points on the detected objects as association criteria, yielding a much efficient performance. Grinvald et al.  propose to measure the similarity between semantic labels and Ok et al.  propose to leverage the correlation of hue saturation histogram. The major drawback of these method is that the designed features or descriptors are typically not general or robust enough and can easily cause incorrect associations.
Weng et al. 
for the first time propose the nonparametric statistical testing for semantic data association, which can address the problems where the statistics do not follow a Gaussian distribution. Later on, Iqbalet al.  also verify the effectiveness of nonparametric data association. However, such a method cannot effectively address the statistics that follows Gaussian distributions, thus cannot sufficiently aggregate different measurements in SLAM. Based on this observation, we combine the parametric and nonparametric methods to perform model ensembling, which exhibits superior association performance in the complex scenarios with the presence of multiple categories of objects.
The integration of objects significantly enlarge the application ranges of traditional SLAM. Some studies [15, 12, 8] treat objects as landmarks to estimate camera poses, or use objects for relocation . Some studies  leverage object size to constrain the scale of monocular SLAM, or remove dynamic objects to improve pose estimation accuracy [22, 18]. In recent years, the combination of object SLAM and grasping [14, 19] has also attracted many interests, and facilitate the research on autonomous mobile manipulate.
Object models in semantic SLAM can be broadly divided into three categories: instance-level models, category-specific models, and general models. The instance-level models [21, 2] depend on a well-established database that records all the related objects. The prior information of objects provides important object-camera constraints for graph optimization. Since the models need to be known in advance, the application scenarios of such methods are limited. There are also some studies on category-specific models, which focus on describing the category-level features. For example, Parkhiya et al.  and Joshi et al.  use the CNN network to estimate the view point of the object and then projecte the 3D line segments onto image plane to align the them. The general model adopts simple geometric elements, e.g., cubes [25, 9], quadrics [15, 4] and cylinders , to represent objects, which is also the most commonly-used model.
In terms of the joint optimization of camera and object poses, Frost et al.  simply integrate object centroids as point clouds to the camera pose estimation process. Yang et al.  propose a joint camera-object-point optimization scheme to construct the pose and scale constraints for graph optimization. Nicholson et al.  propose to project the quadric onto image plane, and then calculate the scale error between the projected 2D rectangular and the detected bounding box. This work also adopts the joint optimization strategy, but with a novel initialization method, which can significantly improve the optimization of the solutions.
The proposed object SLAM framework is demonstrated in Fig. 2, which is developed based on ORB-SLAM2 , and additionally integrates a semantic thread that adopts YOLOv3 as the object detector. The ensemble data association is implemented in the tracking thread, which combines the information of bounding boxes, semantic labels and point clouds. After that, the iForest is leveraged to eliminate outliers to help find an accurate initialization for the joint optimization. The object pose and scale are then optimized together with the camera pose to build a lightweight and object-oriented map. Lastly, the map is combined with the semi-dense map to obtain the final semi-dense semantic map.
Throughout this section, the following notations are used:
The nonparametric test is leveraged to process the point cloud of an object (the green points in Fig. 3 (a)), which follows a non-Gaussian distribution according to our experimental studies (see Section VI-A). Theoretically, if and belong to the same object, they should follow the same distribution, i.e., . We use the Wilcoxon Rank-Sum test  to verify whether the hypothesis holds.
We first concatenate the two point clouds , and then sort in three dimensions respectively. Define as follows,
and define in the same way. The Mann-Whitney statistic is , which can be proved to follow a Gaussian distribution asymptoticly . Herein, we actually construct a Gaussian statistic using the non-Gaussian point clouds. The mean and variance values of the current point on the underlying distribution can be calculated as follows:
where , and .
To make the hypothesis stand, should meet the following constraints:
where is the significance level, is thus the confidence level, and defines the confidence region. The scalar is defined on a normalized Gaussian distribution . In summary, if the Mann-Whitney statistics of two point clouds satisfies Eq. (4), it means that they comes from the same object and the data association success.
Suppose and are from the same object, define statistic as follows,
To make the hypothesis stand, should satisfy:
Due to the strict data association strategy above or the bad angle of views, some existing objects may be recognized as new ones. Hence, a double-sample -test is leveraged to determine whether to merge the objects by testing their history object centroids (the stars in Fig. 3 (c)).
Throughout this section, the following notations are used:
- the translation (location) of object frame in world frame.
- the rotation of object frame w.r.t. world frame. is matrix representation.
- the transformation of object frame w.r.t. world frame.
- half of the side length of a 3D bounding box, i.e., the scale of an object.
- the coordinates of eight vertices of a cube in object and world frame, respectively.
- the quadric parameterized by its semiaxis in object and world frame, respectively, where .
- calculate the angle of line segments .
- the intrinsic and extrinsic parameters of camera.
- the coordinates of a point in world frame.
Object Representation: In this work, we leverage the cubes and quadrics to represent objects, rather than the complex instance-level or category-level model. For regular objects with regular shapes, such as books, keyboards, and chairs, we use cubes (encoded by its vertices ) to represent them. For non-regular objects without an explicit direction, such as balls, bottles, and cups, the quadric (encoded by its semiaxis ) is used for representation. Here, and are expressed in object frame and only depend on the scale parameter . To register these elements to global map, we also need to estimate their translation and orientation w.r.t. global frame. The cubes and quadrics in global frame are expressed as follows, respectively:
With the assumption that the objects are placed parallel with the ground, i.e., , we only need to estimate for a cube and for a quadric.
Estimate and : Suppose a point cloud for an object in global frame, we follow conventions and take its mean value as , based on which, the scale can be calculated by . The main challenge here is that is typically with many outliers, which can introduce a large bias to the estimation of and . One of our major contributions in this paper is the development of an outlier-robust centroid and scale estimation algorithm based on the iForest  to improve the estimation accuracy. The detailed procedure of our algorithm is presented in Alg. 1.
The key idea of the algorithm is to recursively separate the data space into a series of isolated data points, and take the easily isolated ones as outliers. The philosophy is that, normal points is typically located more closely and thus need more steps to isolate, while the outliers usually scatter sparsely and can be easily isolated with less steps. As indicated by the algorithm, we first create isolated trees (the iForest) using the point cloud of an object (lines 2 and 14-33), and then identify the outliers by counting the path length of each point (lines 3-9), in which the score function is defined as follows:
where is a normalization parameter and is a weight coefficient. As demonstrated in Fig. 4(d)-(e), the yellow point is isolated after four steps, thus its path length is 4, and the green point has a path length of 8. Therefore, the yellow point is more likely to be an outlier. In our implementation, points with a score greater than 0.6 are removed, and the remainings are used to calculate and (lines 10-12). Based on , we can initially construct the cubics and quadratics in the object frame, as shown in Fig. 4(a)-(c). will be further optimized along with the object and camera poses later on.
Estimate : The estimation of is divided into two steps, namely to find a good initial value for first and then conduct numerical optimization based on the initial value. Since pose estimation is a non-linear process, a good initialization is very important to help improve the optimality of the estimation result. Conventional methods [9, 20] usually neglect the initialization process, which typically yields inaccurate results.
The details of pose initialization algorithm is presented in Alg. 2. The inputs are obtained as follows: 1) LSD segments are extracted from consecutive image and those falling in the bounding boxes are assigned to the corresponding objects (see Fig. 5a); 2) The initial pose of an object is assumed to be consistent with the global frame, i.e., (see Fig. 5b). In the algorithm, we first uniformly sample thirty angles within (line 2). For each sample, we then evaluate its score by calculating the accumulated angle errors between LSD segments and the projected edges of the cube (lines 3-12). The error is defined as follows:
A demonstration of the calculation of is visualized in Fig. 5 (e)-(g). The score function is defined as follows:
where is the total number of line segments of the object in the current frame, is the number of line segments that satisfy , is a manually defined error threshold (five degrees here), and is the average error of these line segments with . After evaluating all the samples, we choose the one that achieves the highest score as the initial yaw angle for the following optimization process (line 13).
Joint Optimization: After obtaining the initial and , we jointly optimize object and the camera poses as follows:
where the first term is the object pose error defined in Eq. (13) and the scale error defined as the distance between the projected edges of a cube and their nearest parallel LSD segments. The second term is the commonly-sued reprojection error in traditional SLAM framework.
For data association, the adopted measurements for statistic tests include the point clouds and their centroids of an object. To prove our hypothesis about the distributions of different measurements, we collect a large amount of data and visualize their distributions in Fig. 6.
Fig. 6 (a) shows the distributions of the point clouds of 13 objects during the data association in the fr3_long_office sequence, which obviously do not follow the Gaussian distribution. It can be seen that the distributions are related to characteristics of the objects, and do not show consistent behaviors. Fig. 6 (b) shows the error distribution of object centroids in different frames, which typically follow the Gaussian distribution. This result verifies the reasonability of applying the nonparametric Wilcoxon Rank-Sum test for point clouds and using the t-test for object centroids.
We compare our method with the commonly-used IoU method, nonparametric test (NP), and t-test. Fig. 7 shows the association results of these methods in fr3_long_office sequence. It can be seen that some objects are not correctly associated in (a)-(c). Due to the lack of association information, existing objects are often misrecognized as new ones by these methods once the objects are occluded or disappear in some frames. In contrast, our method is much more robust and can effectively address this problem (see Fig. 7(d)). The results of other sequences are shown in Table I
, and we use the same evaluation metric as, which measures the number of objects that finally present in the map. The GT represents the ground-truth number. As we can see, our method achieve a high success rate of association, thus less objects are presented in the map, which significantly demonstrate its effectiveness.
We also compare our method with the state-of-the-art , and the results are shown in Table II. As is indicated, our method can significantly outperform . Especially in the TUM dataset, the number of successfully associated objects by our method is almost twice than that by . In Microsoft RGBD and Scenes V2, the advantage is not obvious since the number of objects is limited there. Reasons of the inaccurate association of  here lie in two folds: 1) A clustering algorithm is leveraged to tackling the problem mentioned above, which removes most of the candidate objects; 2) The method does not exploit different statistics, making the candidates not accurately associated. .
We superimpose the cubes and quadrics of objects on semi-dense maps for qualitative evaluation. Fig. 8 presents the pose estimation results of the objects in 14 sequences of the three datasets, in which the objects are placed randomly and in different directions. As is shown, the proposed method achieves promising results with a monocular camera, which demonstrate the effectiveness of our pose estimation algorithm. Since the datasets are not specially designed for object pose estimation, there is no ground truth for quantitatively evaluate the methods. Here, we compare before initialization (BI), after initialization (AI), and after joint optimization (JO). As shown in Table III, the original direction of the object is parallel to the global frame, and there is a large angle error. After pose initialization, the error is decreased, and after the joint optimization, the error is further reduced, which verifies the effectiveness of our pose estimation algorithm.
Lastly, we build the object-oriented semantic maps based on the robust data association algorithm, the accurate object pose estimation algorithm, and a semi-dense mapping system. Fig. 9 shows two examples of TUM fr3_long_office and fr2_desk, where (d) and (e) show a semi-dense semantic map and an object-oriented map, build by EAO-SLAM. Compared with the sparse map of ORB-SLAM2, our maps can express the environment much better. Moreover, the object-oriented map shows the superior performance in environment understanding than the semi-dense map proposed in .
The mapping results of other sequences in TUM, Microsoft RGB-D, and Scenes V2 datasets are shown in Fig. 10. It can be seen that EAO-SLAM can process multiple classes of objects with different scales and orientations in complex environments. Inevitably, there are some inaccurate estimations. For instance, in the fire sequence, the chair is too large to be well observed by the fast moving camera, thus yielding an inaccurate estimation. We also conduct experiment in a real scenario, Fig. 11. It can be seen even the objects are occluded, they can be accurately estimated, which further verifies the robustness and accuracy of our system.
|Tum||Microsoft RGBD||Scenes V2|
In this paper, we present the EAO-SLAM system that aims to build semi-dense or lightweight object-oriented maps. The system is implemented based on a robust ensemble data association method and an accurate pose estimation framework. Extensive experiments show that our proposed algorithms and SLAM system can build accurate object-oriented maps with object poses and scales accurately registered. The methodologies presented in this work further push the limits of semantic SLAM and will facilitate related researches on robot navigation, mobile manipulation, and human-robot interaction.
2018 24th International Conference on Pattern Recognition (ICPR), pp. 1658–1663. Cited by: §VI-D.
Localization of classified objects in slam using nonparametric statistics and clustering. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 161–168. Cited by: §I, §II-A, §VI-B, §VI-B.
Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 6 (1), pp. 1–39. Cited by: §V.
Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §III.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §II-B.