Cube_SLAM_wu
Cube SLAM 个人注释版
view repo
Objectlevel data association and pose estimation play a fundamental role in semantic SLAM, which remain unsolved due to the lack of robust and accurate algorithms. In this work, we propose an ensemble data associate strategy to integrate the parametric and nonparametric statistic tests. By exploiting the nature of different statistics, our method can effectively aggregate the information of different measurements, and thus significantly improve the robustness and accuracy of the association process. We then present an accurate object pose estimation framework, in which an outlierrobust centroid and scale estimation algorithm and an object pose initialization algorithm are developed to help improve the optimality of the estimated results. Furthermore, we build a SLAM system that can generate semidense or lightweight objectoriented maps with a monocular camera. Extensive experiments are conducted on three publicly available datasets and a real scenario. The results show that our approach significantly outperforms stateoftheart techniques in accuracy and robustness.
READ FULL TEXT VIEW PDF
Simultaneous mapping and localization (SLAM) in an real indoor environme...
read it
ICP algorithms typically involve a fixed choice of data association meth...
read it
Recent Semantic SLAM methods combine classical geometrybased estimation...
read it
We propose GeoFusion, a SLAMbased scene estimation method for building ...
read it
This paper presents a semantic planar SLAM system that improves pose
est...
read it
In an effort to increase the capabilities of SLAM systems and produce
ob...
read it
The Visual SLAM method is widely used in self localization and mapping i...
read it
Cube SLAM 个人注释版
Ensemble Data Association for Monocular Object SLAM
Conventional visual SLAM system has achieved significant success in robot localization and mapping tasks. More attentions in recent years are emphasized on enabling SLAM to serve for robot navigation, object manipulation, environment understanding, and other highlevel needs. Along with the development of object recognition techniques, semantic SLAM provides a promising solution for these applications.
Semantic SLAM leverages segmentation methods to label map elements and focuses on an overall expression of the environment. The typical approaches usually depend on RGBD cameras and require a large amount of storage for map representation. As a specific branch of semantic SLAM, object SLAM aim to leverage the semantic information of objects to improve the robustness and accuracy of camera pose estimation, or the ability of semantic environment reconstruction, which exhibits significant advantages in many applications [5, 16, 20] and has drawn many attentions from the communities. In this work, we further extend the meaning of object SLAM by enabling it to build lightweight and objectoriented maps, as demonstrated in Fig. 1, in which the objects are represented by cubes or quadrics with their locations, poses, and scales accurately registered. The challenges of object SLAM mainly lie in two folds: 1) Existing data association methods [22, 11, 7] are not rather robust or accurate for tackling complex environments with multiple object instances, and there does not exist a practical solution to systematically address this problem. For instance, some studies, e.g., [15], may directly assume the problem is wellsolved. 2) Object pose estimation is not accurate, especially for monocular object SLAM. Although some improvements are achieved in recent studies [25, 21, 17], a relatively complete observation of the object is typically required, which is difficult to achieve in realworld applications.
In this paper, we propose the EAOSLAM, a monocular object SLAM system that can effectively address the data association and pose estimation problems. Firstly, leveraging the different measurements in SLAM, we integrate the parametric and nonparametric statistic testings, as well as the traditional IoUbased method, to to conduct model ensembling for data association. Compared with conventional methods, our approach sufficiently exploits the nature of different statistics, and exhibits significant advantages: 1) The association is based on the ensemble of multiple models, which can tackle most cases robustly; 2) We additionally introduce a new testing method, called
doublesample ttest
, to assist the association process, which can effectively handle challenging cases and help improve the overall association accuracy. In terms of object pose estimation, we propose a centroid and scale estimation algorithm and an object pose initialization algorithm, based on the isolation forest (iForest) and several proposed score functions, to improve the optimality of estimation results. The algorithms are robust to outliers and present high accuracies, which significantly facilitate the joint optimization of object and camera poses.The contributions of this paper are summarized as follows:
We propose an ensemble data association strategy that can effectively aggregate different measurements of the objects to improve association accuracy.
We propose an object pose estimation framework based on iForest, which is robust to outliers and can accurately estimate the locations, poses, and scales of objects.
Based on the proposed method, we implement the EAOSALM to build lightweight and objectoriented maps.
We conduct comprehensive experiments and verify the effectiveness of our proposed methods on three publicly available datasets. The source code is also released^{1}^{1}1https://github.com/yanminwu/EAOSLAM.
Data association is an indispensable ingredient for semantic SLAM, which is used to determine whether the object observed in the current frame corresponds to an existing object in the map. Bowman et al. [1] use a probabilistic method to model the data association process and leverage the EM algorithm to find correspondences between observed landmarks. Subsequent studies [22, 24] further extend the idea to associate dynamic objects or conduct semantic dense reconstruction. These methods can achieve high association accuracy, but can only process a limited number of object instances. Their efficiency also remains to be improved due to the expensive EM optimization process.
Object tracking is another commonlyused approach in data association, For instance, the Kalman filter is used to associate vehicles by predicting their locations in consecutive frames
[26, 3]. Li et al. [9] project 3D cubes to image plane, and then leverage the Hungarian tracking algorithm to conduct association using the projected 2D bounding boxes. Trackingbased methods are with high runtime efficiency, but can easily generate incorrect priors in complex environments, yielding incorrect association results.In recent studies, more data association approaches are developed based on the maximum shared information. Liu et al. [11] propose the random walk descriptors to represent the topological relationships between objects, and those with the maximum number of shared descriptors are regarded as the same instance. Instead, Yang et al. [25] propose to directly count the number of matched map points on the detected objects as association criteria, yielding a much efficient performance. Grinvald et al. [5] propose to measure the similarity between semantic labels and Ok et al. [16] propose to leverage the correlation of hue saturation histogram. The major drawback of these method is that the designed features or descriptors are typically not general or robust enough and can easily cause incorrect associations.
Weng et al. [12]
for the first time propose the nonparametric statistical testing for semantic data association, which can address the problems where the statistics do not follow a Gaussian distribution. Later on, Iqbal
et al. [7] also verify the effectiveness of nonparametric data association. However, such a method cannot effectively address the statistics that follows Gaussian distributions, thus cannot sufficiently aggregate different measurements in SLAM. Based on this observation, we combine the parametric and nonparametric methods to perform model ensembling, which exhibits superior association performance in the complex scenarios with the presence of multiple categories of objects.The integration of objects significantly enlarge the application ranges of traditional SLAM. Some studies [15, 12, 8] treat objects as landmarks to estimate camera poses, or use objects for relocation [9]. Some studies [3] leverage object size to constrain the scale of monocular SLAM, or remove dynamic objects to improve pose estimation accuracy [22, 18]. In recent years, the combination of object SLAM and grasping [14, 19] has also attracted many interests, and facilitate the research on autonomous mobile manipulate.
Object models in semantic SLAM can be broadly divided into three categories: instancelevel models, categoryspecific models, and general models. The instancelevel models [21, 2] depend on a wellestablished database that records all the related objects. The prior information of objects provides important objectcamera constraints for graph optimization. Since the models need to be known in advance, the application scenarios of such methods are limited. There are also some studies on categoryspecific models, which focus on describing the categorylevel features. For example, Parkhiya et al. [17] and Joshi et al. [8] use the CNN network to estimate the view point of the object and then projecte the 3D line segments onto image plane to align the them. The general model adopts simple geometric elements, e.g., cubes [25, 9], quadrics [15, 4] and cylinders [17], to represent objects, which is also the most commonlyused model.
In terms of the joint optimization of camera and object poses, Frost et al. [3] simply integrate object centroids as point clouds to the camera pose estimation process. Yang et al. [25] propose a joint cameraobjectpoint optimization scheme to construct the pose and scale constraints for graph optimization. Nicholson et al. [15] propose to project the quadric onto image plane, and then calculate the scale error between the projected 2D rectangular and the detected bounding box. This work also adopts the joint optimization strategy, but with a novel initialization method, which can significantly improve the optimization of the solutions.
The proposed object SLAM framework is demonstrated in Fig. 2, which is developed based on ORBSLAM2 [13], and additionally integrates a semantic thread that adopts YOLOv3 as the object detector. The ensemble data association is implemented in the tracking thread, which combines the information of bounding boxes, semantic labels and point clouds. After that, the iForest is leveraged to eliminate outliers to help find an accurate initialization for the joint optimization. The object pose and scale are then optimized together with the camera pose to build a lightweight and objectoriented map. Lastly, the map is combined with the semidense map to obtain the final semidense semantic map.
Throughout this section, the following notations are used:
 the point clouds of objects.
 the rank (position) of a data point in a sorted list.
 the currently observed object centroid.
 the history observations of the centroids of an object.
 the probability function used for statistic test.
 the mean and variance functions.
The nonparametric test is leveraged to process the point cloud of an object (the green points in Fig. 3 (a)), which follows a nonGaussian distribution according to our experimental studies (see Section VIA). Theoretically, if and belong to the same object, they should follow the same distribution, i.e., . We use the Wilcoxon RankSum test [23] to verify whether the hypothesis holds.
We first concatenate the two point clouds , and then sort in three dimensions respectively. Define as follows,
(1) 
and define in the same way. The MannWhitney statistic is , which can be proved to follow a Gaussian distribution asymptoticly [23]. Herein, we actually construct a Gaussian statistic using the nonGaussian point clouds. The mean and variance values of the current point on the underlying distribution can be calculated as follows:
(2)  
(3) 
where , and .
To make the hypothesis stand, should meet the following constraints:
(4) 
where is the significance level, is thus the confidence level, and defines the confidence region. The scalar is defined on a normalized Gaussian distribution . In summary, if the MannWhitney statistics of two point clouds satisfies Eq. (4), it means that they comes from the same object and the data association success.
The singlesample test is used to process object centroids observed in different frames (the stars in Fig. 3 (b)), which typically follow a Gaussian distribution (see Section VIA).
Suppose and are from the same object, define statistic as follows,
(5) 
To make the hypothesis stand, should satisfy:
(6) 
where is the upper quantile of the tdistribution of degrees of freedom, and . If statistic satisfies (6), and comes from the same object.
Due to the strict data association strategy above or the bad angle of views, some existing objects may be recognized as new ones. Hence, a doublesample test is leveraged to determine whether to merge the objects by testing their history object centroids (the stars in Fig. 3 (c)).
Construct statistics for and as follows,
(7) 
(8) 
where
is the pooled standard deviation of the two objects. Similarly, if
satisfies (6), , it means that and belongs to the same object, then we merge them.Throughout this section, the following notations are used:
 the translation (location) of object frame in world frame.
 the rotation of object frame w.r.t. world frame. is matrix representation.
 the transformation of object frame w.r.t. world frame.
 half of the side length of a 3D bounding box, i.e., the scale of an object.
 the coordinates of eight vertices of a cube in object and world frame, respectively.
 the quadric parameterized by its semiaxis in object and world frame, respectively, where .
 calculate the angle of line segments .
 the intrinsic and extrinsic parameters of camera.
 the coordinates of a point in world frame.
Object Representation: In this work, we leverage the cubes and quadrics to represent objects, rather than the complex instancelevel or categorylevel model. For regular objects with regular shapes, such as books, keyboards, and chairs, we use cubes (encoded by its vertices ) to represent them. For nonregular objects without an explicit direction, such as balls, bottles, and cups, the quadric (encoded by its semiaxis ) is used for representation. Here, and are expressed in object frame and only depend on the scale parameter . To register these elements to global map, we also need to estimate their translation and orientation w.r.t. global frame. The cubes and quadrics in global frame are expressed as follows, respectively:
(9) 
(10) 
With the assumption that the objects are placed parallel with the ground, i.e., , we only need to estimate for a cube and for a quadric.
Estimate and : Suppose a point cloud for an object in global frame, we follow conventions and take its mean value as , based on which, the scale can be calculated by . The main challenge here is that is typically with many outliers, which can introduce a large bias to the estimation of and . One of our major contributions in this paper is the development of an outlierrobust centroid and scale estimation algorithm based on the iForest [10] to improve the estimation accuracy. The detailed procedure of our algorithm is presented in Alg. 1.
The key idea of the algorithm is to recursively separate the data space into a series of isolated data points, and take the easily isolated ones as outliers. The philosophy is that, normal points is typically located more closely and thus need more steps to isolate, while the outliers usually scatter sparsely and can be easily isolated with less steps. As indicated by the algorithm, we first create isolated trees (the iForest) using the point cloud of an object (lines 2 and 1433), and then identify the outliers by counting the path length of each point (lines 39), in which the score function is defined as follows:
(11) 
(12) 
where is a normalization parameter and is a weight coefficient. As demonstrated in Fig. 4(d)(e), the yellow point is isolated after four steps, thus its path length is 4, and the green point has a path length of 8. Therefore, the yellow point is more likely to be an outlier. In our implementation, points with a score greater than 0.6 are removed, and the remainings are used to calculate and (lines 1012). Based on , we can initially construct the cubics and quadratics in the object frame, as shown in Fig. 4(a)(c). will be further optimized along with the object and camera poses later on.
Estimate : The estimation of is divided into two steps, namely to find a good initial value for first and then conduct numerical optimization based on the initial value. Since pose estimation is a nonlinear process, a good initialization is very important to help improve the optimality of the estimation result. Conventional methods [9, 20] usually neglect the initialization process, which typically yields inaccurate results.
The details of pose initialization algorithm is presented in Alg. 2. The inputs are obtained as follows: 1) LSD segments are extracted from consecutive image and those falling in the bounding boxes are assigned to the corresponding objects (see Fig. 5a); 2) The initial pose of an object is assumed to be consistent with the global frame, i.e., (see Fig. 5b). In the algorithm, we first uniformly sample thirty angles within (line 2). For each sample, we then evaluate its score by calculating the accumulated angle errors between LSD segments and the projected edges of the cube (lines 312). The error is defined as follows:
(13)  
A demonstration of the calculation of is visualized in Fig. 5 (e)(g). The score function is defined as follows:
(14) 
where is the total number of line segments of the object in the current frame, is the number of line segments that satisfy , is a manually defined error threshold (five degrees here), and is the average error of these line segments with . After evaluating all the samples, we choose the one that achieves the highest score as the initial yaw angle for the following optimization process (line 13).
Joint Optimization: After obtaining the initial and , we jointly optimize object and the camera poses as follows:
(15) 
where the first term is the object pose error defined in Eq. (13) and the scale error defined as the distance between the projected edges of a cube and their nearest parallel LSD segments. The second term is the commonlysued reprojection error in traditional SLAM framework.
For data association, the adopted measurements for statistic tests include the point clouds and their centroids of an object. To prove our hypothesis about the distributions of different measurements, we collect a large amount of data and visualize their distributions in Fig. 6.
Fig. 6 (a) shows the distributions of the point clouds of 13 objects during the data association in the fr3_long_office sequence, which obviously do not follow the Gaussian distribution. It can be seen that the distributions are related to characteristics of the objects, and do not show consistent behaviors. Fig. 6 (b) shows the error distribution of object centroids in different frames, which typically follow the Gaussian distribution. This result verifies the reasonability of applying the nonparametric Wilcoxon RankSum test for point clouds and using the ttest for object centroids.
We compare our method with the commonlyused IoU method, nonparametric test (NP), and ttest. Fig. 7 shows the association results of these methods in fr3_long_office sequence. It can be seen that some objects are not correctly associated in (a)(c). Due to the lack of association information, existing objects are often misrecognized as new ones by these methods once the objects are occluded or disappear in some frames. In contrast, our method is much more robust and can effectively address this problem (see Fig. 7(d)). The results of other sequences are shown in Table I
, and we use the same evaluation metric as
[7], which measures the number of objects that finally present in the map. The GT represents the groundtruth number. As we can see, our method achieve a high success rate of association, thus less objects are presented in the map, which significantly demonstrate its effectiveness.We also compare our method with the stateoftheart [7], and the results are shown in Table II. As is indicated, our method can significantly outperform [7]. Especially in the TUM dataset, the number of successfully associated objects by our method is almost twice than that by [7]. In Microsoft RGBD and Scenes V2, the advantage is not obvious since the number of objects is limited there. Reasons of the inaccurate association of [7] here lie in two folds: 1) A clustering algorithm is leveraged to tackling the problem mentioned above, which removes most of the candidate objects; 2) The method does not exploit different statistics, making the candidates not accurately associated. .
IoU  IoUNP  IoUttest  EAO  GT  

Fr1_desk  62  47  41  14  16 
Fr2_desk  83  64  52  22  25 
Fr3_office  150  128  130  42  45 
Fr3_teddy  32  17  21  6  7 
We superimpose the cubes and quadrics of objects on semidense maps for qualitative evaluation. Fig. 8 presents the pose estimation results of the objects in 14 sequences of the three datasets, in which the objects are placed randomly and in different directions. As is shown, the proposed method achieves promising results with a monocular camera, which demonstrate the effectiveness of our pose estimation algorithm. Since the datasets are not specially designed for object pose estimation, there is no ground truth for quantitatively evaluate the methods. Here, we compare before initialization (BI), after initialization (AI), and after joint optimization (JO). As shown in Table III, the original direction of the object is parallel to the global frame, and there is a large angle error. After pose initialization, the error is decreased, and after the joint optimization, the error is further reduced, which verifies the effectiveness of our pose estimation algorithm.
Lastly, we build the objectoriented semantic maps based on the robust data association algorithm, the accurate object pose estimation algorithm, and a semidense mapping system. Fig. 9 shows two examples of TUM fr3_long_office and fr2_desk, where (d) and (e) show a semidense semantic map and an objectoriented map, build by EAOSLAM. Compared with the sparse map of ORBSLAM2, our maps can express the environment much better. Moreover, the objectoriented map shows the superior performance in environment understanding than the semidense map proposed in [6].
The mapping results of other sequences in TUM, Microsoft RGBD, and Scenes V2 datasets are shown in Fig. 10. It can be seen that EAOSLAM can process multiple classes of objects with different scales and orientations in complex environments. Inevitably, there are some inaccurate estimations. For instance, in the fire sequence, the chair is too large to be well observed by the fast moving camera, thus yielding an inaccurate estimation. We also conduct experiment in a real scenario, Fig. 11. It can be seen even the objects are occluded, they can be accurately estimated, which further verifies the robustness and accuracy of our system.
Seq 
Tum  Microsoft RGBD  Scenes V2  

fr1_desk  fr2_desk  fr3_long_office  fr3_teddy  Chess  Fire  Office  Pumpkin  Heads  01  07  10  13  14 
[20] 
  11  15  2  5  4  10  4    5    6  3  4 
Ours 
14  22  42  6  13  6  21  6  15  7  7  7  3  5 
Truth 
16  23  45  7  16  6  27  6  18  8  7  7  3  6 

Seq 
fr3_long_office  fr1_desk  fr2_desk  Mean  

Objects 
book1  book2  book3  keyboard1  keyboard2  mouse  Book1  Book2  Tvmonitor1  Tvmonitor2  keyboard  Book1  Book2  mouse  
BI 
19.2  11.4  16.2  10.3  7.4  11.3  33.5  15.2  32.7  22.5  8.9  15.5  16.9  8.7  16.4 
AI 
5.3  5.5  6.2  7.2  4.2  6.4  8.6  8.9  6.0  11.4  5.5  3.8  10.1  7.5  6.9 
JO 
3.1  4.3  5.7  2.5  2.8  4.3  5.4  7.6  8.7  10.2  3.9  5.1  6.4  7.9  5.6 

In this paper, we present the EAOSLAM system that aims to build semidense or lightweight objectoriented maps. The system is implemented based on a robust ensemble data association method and an accurate pose estimation framework. Extensive experiments show that our proposed algorithms and SLAM system can build accurate objectoriented maps with object poses and scales accurately registered. The methodologies presented in this work further push the limits of semantic SLAM and will facilitate related researches on robot navigation, mobile manipulation, and humanrobot interaction.
2018 24th International Conference on Pattern Recognition (ICPR)
, pp. 1658–1663. Cited by: §VID.Localization of classified objects in slam using nonparametric statistics and clustering
. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 161–168. Cited by: §I, §IIA, §VIB, §VIB.Isolationbased anomaly detection
. ACM Transactions on Knowledge Discovery from Data (TKDD) 6 (1), pp. 1–39. Cited by: §V.Orbslam2: an opensource slam system for monocular, stereo, and rgbd cameras
. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §III.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pp. 0–0. Cited by: §IIB.
Comments
There are no comments yet.