EAO-SLAM: Monocular Semi-Dense Object SLAM Based on Ensemble Data Association

04/27/2020 ∙ by Yanmin Wu, et al. ∙ Northeastern University 9

Object-level data association and pose estimation play a fundamental role in semantic SLAM, which remain unsolved due to the lack of robust and accurate algorithms. In this work, we propose an ensemble data associate strategy to integrate the parametric and nonparametric statistic tests. By exploiting the nature of different statistics, our method can effectively aggregate the information of different measurements, and thus significantly improve the robustness and accuracy of the association process. We then present an accurate object pose estimation framework, in which an outlier-robust centroid and scale estimation algorithm and an object pose initialization algorithm are developed to help improve the optimality of the estimated results. Furthermore, we build a SLAM system that can generate semi-dense or lightweight object-oriented maps with a monocular camera. Extensive experiments are conducted on three publicly available datasets and a real scenario. The results show that our approach significantly outperforms state-of-the-art techniques in accuracy and robustness.



There are no comments yet.


page 1

page 3

page 6

page 8

Code Repositories


Cube SLAM 个人注释版

view repo


Ensemble Data Association for Monocular Object SLAM

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Conventional visual SLAM system has achieved significant success in robot localization and mapping tasks. More attentions in recent years are emphasized on enabling SLAM to serve for robot navigation, object manipulation, environment understanding, and other high-level needs. Along with the development of object recognition techniques, semantic SLAM provides a promising solution for these applications.

Semantic SLAM leverages segmentation methods to label map elements and focuses on an overall expression of the environment. The typical approaches usually depend on RGB-D cameras and require a large amount of storage for map representation. As a specific branch of semantic SLAM, object SLAM aim to leverage the semantic information of objects to improve the robustness and accuracy of camera pose estimation, or the ability of semantic environment reconstruction, which exhibits significant advantages in many applications [5, 16, 20] and has drawn many attentions from the communities. In this work, we further extend the meaning of object SLAM by enabling it to build lightweight and object-oriented maps, as demonstrated in Fig. 1, in which the objects are represented by cubes or quadrics with their locations, poses, and scales accurately registered. The challenges of object SLAM mainly lie in two folds: 1) Existing data association methods [22, 11, 7] are not rather robust or accurate for tackling complex environments with multiple object instances, and there does not exist a practical solution to systematically address this problem. For instance, some studies, e.g., [15], may directly assume the problem is well-solved. 2) Object pose estimation is not accurate, especially for monocular object SLAM. Although some improvements are achieved in recent studies [25, 21, 17], a relatively complete observation of the object is typically required, which is difficult to achieve in real-world applications.

Fig. 1: A lightweight and object-oriented semantic map.

In this paper, we propose the EAO-SLAM, a monocular object SLAM system that can effectively address the data association and pose estimation problems. Firstly, leveraging the different measurements in SLAM, we integrate the parametric and nonparametric statistic testings, as well as the traditional IoU-based method, to to conduct model ensembling for data association. Compared with conventional methods, our approach sufficiently exploits the nature of different statistics, and exhibits significant advantages: 1) The association is based on the ensemble of multiple models, which can tackle most cases robustly; 2) We additionally introduce a new testing method, called

double-sample t-test

, to assist the association process, which can effectively handle challenging cases and help improve the overall association accuracy. In terms of object pose estimation, we propose a centroid and scale estimation algorithm and an object pose initialization algorithm, based on the isolation forest (iForest) and several proposed score functions, to improve the optimality of estimation results. The algorithms are robust to outliers and present high accuracies, which significantly facilitate the joint optimization of object and camera poses.

The contributions of this paper are summarized as follows:

  • We propose an ensemble data association strategy that can effectively aggregate different measurements of the objects to improve association accuracy.

  • We propose an object pose estimation framework based on iForest, which is robust to outliers and can accurately estimate the locations, poses, and scales of objects.

  • Based on the proposed method, we implement the EAO-SALM to build lightweight and object-oriented maps.

  • We conduct comprehensive experiments and verify the effectiveness of our proposed methods on three publicly available datasets. The source code is also released111https://github.com/yanmin-wu/EAO-SLAM.

Ii Related Work

Ii-a Data Association

Data association is an indispensable ingredient for semantic SLAM, which is used to determine whether the object observed in the current frame corresponds to an existing object in the map. Bowman et al. [1] use a probabilistic method to model the data association process and leverage the EM algorithm to find correspondences between observed landmarks. Subsequent studies [22, 24] further extend the idea to associate dynamic objects or conduct semantic dense reconstruction. These methods can achieve high association accuracy, but can only process a limited number of object instances. Their efficiency also remains to be improved due to the expensive EM optimization process.

Object tracking is another commonly-used approach in data association, For instance, the Kalman filter is used to associate vehicles by predicting their locations in consecutive frames

[26, 3]. Li et al. [9] project 3D cubes to image plane, and then leverage the Hungarian tracking algorithm to conduct association using the projected 2D bounding boxes. Tracking-based methods are with high runtime efficiency, but can easily generate incorrect priors in complex environments, yielding incorrect association results.

In recent studies, more data association approaches are developed based on the maximum shared information. Liu et al. [11] propose the random walk descriptors to represent the topological relationships between objects, and those with the maximum number of shared descriptors are regarded as the same instance. Instead, Yang et al. [25] propose to directly count the number of matched map points on the detected objects as association criteria, yielding a much efficient performance. Grinvald et al. [5] propose to measure the similarity between semantic labels and Ok et al. [16] propose to leverage the correlation of hue saturation histogram. The major drawback of these method is that the designed features or descriptors are typically not general or robust enough and can easily cause incorrect associations.

Weng et al. [12]

for the first time propose the nonparametric statistical testing for semantic data association, which can address the problems where the statistics do not follow a Gaussian distribution. Later on, Iqbal

et al. [7] also verify the effectiveness of nonparametric data association. However, such a method cannot effectively address the statistics that follows Gaussian distributions, thus cannot sufficiently aggregate different measurements in SLAM. Based on this observation, we combine the parametric and nonparametric methods to perform model ensembling, which exhibits superior association performance in the complex scenarios with the presence of multiple categories of objects.

Ii-B Object SLAM

The integration of objects significantly enlarge the application ranges of traditional SLAM. Some studies [15, 12, 8] treat objects as landmarks to estimate camera poses, or use objects for relocation [9]. Some studies [3] leverage object size to constrain the scale of monocular SLAM, or remove dynamic objects to improve pose estimation accuracy [22, 18]. In recent years, the combination of object SLAM and grasping [14, 19] has also attracted many interests, and facilitate the research on autonomous mobile manipulate.

Object models in semantic SLAM can be broadly divided into three categories: instance-level models, category-specific models, and general models. The instance-level models [21, 2] depend on a well-established database that records all the related objects. The prior information of objects provides important object-camera constraints for graph optimization. Since the models need to be known in advance, the application scenarios of such methods are limited. There are also some studies on category-specific models, which focus on describing the category-level features. For example, Parkhiya et al. [17] and Joshi et al. [8] use the CNN network to estimate the view point of the object and then projecte the 3D line segments onto image plane to align the them. The general model adopts simple geometric elements, e.g., cubes [25, 9], quadrics [15, 4] and cylinders [17], to represent objects, which is also the most commonly-used model.

In terms of the joint optimization of camera and object poses, Frost et al. [3] simply integrate object centroids as point clouds to the camera pose estimation process. Yang et al. [25] propose a joint camera-object-point optimization scheme to construct the pose and scale constraints for graph optimization. Nicholson et al. [15] propose to project the quadric onto image plane, and then calculate the scale error between the projected 2D rectangular and the detected bounding box. This work also adopts the joint optimization strategy, but with a novel initialization method, which can significantly improve the optimization of the solutions.

Iii System Overview

The proposed object SLAM framework is demonstrated in Fig. 2, which is developed based on ORB-SLAM2 [13], and additionally integrates a semantic thread that adopts YOLOv3 as the object detector. The ensemble data association is implemented in the tracking thread, which combines the information of bounding boxes, semantic labels and point clouds. After that, the iForest is leveraged to eliminate outliers to help find an accurate initialization for the joint optimization. The object pose and scale are then optimized together with the camera pose to build a lightweight and object-oriented map. Lastly, the map is combined with the semi-dense map to obtain the final semi-dense semantic map.

Fig. 2: The architecture of EAO-SLAM system. Our main contributions are highlighted with red colors.

Iv Ensemble Data Association

Throughout this section, the following notations are used:

  • - the point clouds of objects.

  • - the rank (position) of a data point in a sorted list.

  • - the currently observed object centroid.

  • - the history observations of the centroids of an object.

  • - the probability function used for statistic test.

  • - the mean and variance functions.

Fig. 3: Different statistics in data association.

Iv-a Nonparametric Test

The nonparametric test is leveraged to process the point cloud of an object (the green points in Fig. 3 (a)), which follows a non-Gaussian distribution according to our experimental studies (see Section VI-A). Theoretically, if and belong to the same object, they should follow the same distribution, i.e., . We use the Wilcoxon Rank-Sum test [23] to verify whether the hypothesis holds.

We first concatenate the two point clouds , and then sort in three dimensions respectively. Define as follows,


and define in the same way. The Mann-Whitney statistic is , which can be proved to follow a Gaussian distribution asymptoticly [23]. Herein, we actually construct a Gaussian statistic using the non-Gaussian point clouds. The mean and variance values of the current point on the underlying distribution can be calculated as follows:


where , and .

To make the hypothesis stand, should meet the following constraints:


where is the significance level, is thus the confidence level, and defines the confidence region. The scalar is defined on a normalized Gaussian distribution . In summary, if the Mann-Whitney statistics of two point clouds satisfies Eq. (4), it means that they comes from the same object and the data association success.

Iv-B Single-sample and Double-sample T-test

The single-sample -test is used to process object centroids observed in different frames (the stars in Fig. 3 (b)), which typically follow a Gaussian distribution (see Section VI-A).

Suppose and are from the same object, define statistic as follows,


To make the hypothesis stand, should satisfy:


where is the upper quantile of the t-distribution of degrees of freedom, and . If statistic satisfies (6), and comes from the same object.

Due to the strict data association strategy above or the bad angle of views, some existing objects may be recognized as new ones. Hence, a double-sample -test is leveraged to determine whether to merge the objects by testing their history object centroids (the stars in Fig. 3 (c)).

Construct -statistics for and as follows,



is the pooled standard deviation of the two objects. Similarly, if

satisfies (6), , it means that and belongs to the same object, then we merge them.

V Object SLAM

Throughout this section, the following notations are used:

  • - the translation (location) of object frame in world frame.

  • - the rotation of object frame w.r.t. world frame. is matrix representation.

  • - the transformation of object frame w.r.t. world frame.

  • - half of the side length of a 3D bounding box, i.e., the scale of an object.

  • - the coordinates of eight vertices of a cube in object and world frame, respectively.

  • - the quadric parameterized by its semiaxis in object and world frame, respectively, where .

  • - calculate the angle of line segments .

  • - the intrinsic and extrinsic parameters of camera.

  • - the coordinates of a point in world frame.

1: - The point cloud of an object, - The number of iTrees in iForest, - The subsampling size for an iTree.
2: - The iForest, a set of iTrees, - The origin of local frame, - The initial scale of the object.
3:procedure paraObject()
4:      buildForest()
5:     for point in  do
6:          averageDepth()
7:          score() Eq. (11) and (12)
8:         if  then an empirical value
9:              remove() remove from
10:         end if
11:     end for
12:      meanValue()
13:      (max() - min()) / 2
14:     return
15:end procedure
16:procedure buildForest()
18:      ceiling()
19:     for  1 to  do
20:          randomSample()
21:          buildTree()
22:     end for
24:end procedure
25:procedure buildTree()
26:     if  or  then
27:         return exNode{} record the size of
28:     end if
29:      randomDim(1, 3) get one dimension
30:      randomSpitPoint()
31:      split()
32:     buildTree() get child pointer
33:     buildTree()
34:     return inNode{}
35:end procedure
Algorithm 1 Centroid and Scale Estimation Based on iForest
Fig. 4: Object representation and iForest demonstration.

Object Representation: In this work, we leverage the cubes and quadrics to represent objects, rather than the complex instance-level or category-level model. For regular objects with regular shapes, such as books, keyboards, and chairs, we use cubes (encoded by its vertices ) to represent them. For non-regular objects without an explicit direction, such as balls, bottles, and cups, the quadric (encoded by its semiaxis ) is used for representation. Here, and are expressed in object frame and only depend on the scale parameter . To register these elements to global map, we also need to estimate their translation and orientation w.r.t. global frame. The cubes and quadrics in global frame are expressed as follows, respectively:


With the assumption that the objects are placed parallel with the ground, i.e., , we only need to estimate for a cube and for a quadric.

Estimate and : Suppose a point cloud for an object in global frame, we follow conventions and take its mean value as , based on which, the scale can be calculated by . The main challenge here is that is typically with many outliers, which can introduce a large bias to the estimation of and . One of our major contributions in this paper is the development of an outlier-robust centroid and scale estimation algorithm based on the iForest [10] to improve the estimation accuracy. The detailed procedure of our algorithm is presented in Alg. 1.

The key idea of the algorithm is to recursively separate the data space into a series of isolated data points, and take the easily isolated ones as outliers. The philosophy is that, normal points is typically located more closely and thus need more steps to isolate, while the outliers usually scatter sparsely and can be easily isolated with less steps. As indicated by the algorithm, we first create isolated trees (the iForest) using the point cloud of an object (lines 2 and 14-33), and then identify the outliers by counting the path length of each point (lines 3-9), in which the score function is defined as follows:


where is a normalization parameter and is a weight coefficient. As demonstrated in Fig. 4(d)-(e), the yellow point is isolated after four steps, thus its path length is 4, and the green point has a path length of 8. Therefore, the yellow point is more likely to be an outlier. In our implementation, points with a score greater than 0.6 are removed, and the remainings are used to calculate and (lines 10-12). Based on , we can initially construct the cubics and quadratics in the object frame, as shown in Fig. 4(a)-(c). will be further optimized along with the object and camera poses later on.

Estimate : The estimation of is divided into two steps, namely to find a good initial value for first and then conduct numerical optimization based on the initial value. Since pose estimation is a non-linear process, a good initialization is very important to help improve the optimality of the estimation result. Conventional methods [9, 20] usually neglect the initialization process, which typically yields inaccurate results.

The details of pose initialization algorithm is presented in Alg. 2. The inputs are obtained as follows: 1) LSD segments are extracted from consecutive image and those falling in the bounding boxes are assigned to the corresponding objects (see Fig. 5a); 2) The initial pose of an object is assumed to be consistent with the global frame, i.e., (see Fig. 5b). In the algorithm, we first uniformly sample thirty angles within (line 2). For each sample, we then evaluate its score by calculating the accumulated angle errors between LSD segments and the projected edges of the cube (lines 3-12). The error is defined as follows:


A demonstration of the calculation of is visualized in Fig. 5 (e)-(g). The score function is defined as follows:


where is the total number of line segments of the object in the current frame, is the number of line segments that satisfy , is a manually defined error threshold (five degrees here), and is the average error of these line segments with . After evaluating all the samples, we choose the one that achieves the highest score as the initial yaw angle for the following optimization process (line 13).

1: - Line segments detected by LSD in consecutive images, - The initial guess of yaw angel.
2: - The estimation result of yaw angel, - The estimation errors.
4: sampleAngles(, 30) see Fig. 5 (b)-(d)
5:for sample in  do
7:     for  in {do
8:          score() Eq. (13) and (14)
11:     end for
14:end for
15: argmax()
16:return ,
Algorithm 2 Initialization for Object Pose Estimation
Fig. 5: Line alignment to estimate object direction.

Joint Optimization: After obtaining the initial and , we jointly optimize object and the camera poses as follows:


where the first term is the object pose error defined in Eq. (13) and the scale error defined as the distance between the projected edges of a cube and their nearest parallel LSD segments. The second term is the commonly-sued reprojection error in traditional SLAM framework.

Vi Experimental Results

Vi-a Distributions of Different Statistics

For data association, the adopted measurements for statistic tests include the point clouds and their centroids of an object. To prove our hypothesis about the distributions of different measurements, we collect a large amount of data and visualize their distributions in Fig. 6.

Fig. 6: Distributions of the measurements. (a): position distribution of point clouds in three directions. (b): distance error distribution of centroids.

Fig. 6 (a) shows the distributions of the point clouds of 13 objects during the data association in the fr3_long_office sequence, which obviously do not follow the Gaussian distribution. It can be seen that the distributions are related to characteristics of the objects, and do not show consistent behaviors. Fig. 6 (b) shows the error distribution of object centroids in different frames, which typically follow the Gaussian distribution. This result verifies the reasonability of applying the nonparametric Wilcoxon Rank-Sum test for point clouds and using the t-test for object centroids.

Vi-B Ensemble Data Association Experiments

We compare our method with the commonly-used IoU method, nonparametric test (NP), and t-test. Fig. 7 shows the association results of these methods in fr3_long_office sequence. It can be seen that some objects are not correctly associated in (a)-(c). Due to the lack of association information, existing objects are often misrecognized as new ones by these methods once the objects are occluded or disappear in some frames. In contrast, our method is much more robust and can effectively address this problem (see Fig. 7(d)). The results of other sequences are shown in Table I

, and we use the same evaluation metric as

[7], which measures the number of objects that finally present in the map. The GT represents the ground-truth number. As we can see, our method achieve a high success rate of association, thus less objects are presented in the map, which significantly demonstrate its effectiveness.

We also compare our method with the state-of-the-art [7], and the results are shown in Table II. As is indicated, our method can significantly outperform [7]. Especially in the TUM dataset, the number of successfully associated objects by our method is almost twice than that by [7]. In Microsoft RGBD and Scenes V2, the advantage is not obvious since the number of objects is limited there. Reasons of the inaccurate association of [7] here lie in two folds: 1) A clustering algorithm is leveraged to tackling the problem mentioned above, which removes most of the candidate objects; 2) The method does not exploit different statistics, making the candidates not accurately associated. .

Fig. 7: Qualitative comparison of data association results. (a): IoU method. (b): IoU and nonparametric test. (c): IoU and t-test. (d): our ensemble method.
IoU IoUNP IoUt-test EAO GT
Fr1_desk 62 47 41 14 16
Fr2_desk 83 64 52 22 25
Fr3_office 150 128 130 42 45
Fr3_teddy 32 17 21 6 7

Vi-C Qualitative Assessment of Object Pose Estimation

Fig. 8:

Results of object pose estimation. Odd columns: original RGB images. Even column: estimated object poses.

We superimpose the cubes and quadrics of objects on semi-dense maps for qualitative evaluation. Fig. 8 presents the pose estimation results of the objects in 14 sequences of the three datasets, in which the objects are placed randomly and in different directions. As is shown, the proposed method achieves promising results with a monocular camera, which demonstrate the effectiveness of our pose estimation algorithm. Since the datasets are not specially designed for object pose estimation, there is no ground truth for quantitatively evaluate the methods. Here, we compare before initialization (BI), after initialization (AI), and after joint optimization (JO). As shown in Table III, the original direction of the object is parallel to the global frame, and there is a large angle error. After pose initialization, the error is decreased, and after the joint optimization, the error is further reduced, which verifies the effectiveness of our pose estimation algorithm.

Vi-D Object-Oriented Map Building

Lastly, we build the object-oriented semantic maps based on the robust data association algorithm, the accurate object pose estimation algorithm, and a semi-dense mapping system. Fig. 9 shows two examples of TUM fr3_long_office and fr2_desk, where (d) and (e) show a semi-dense semantic map and an object-oriented map, build by EAO-SLAM. Compared with the sparse map of ORB-SLAM2, our maps can express the environment much better. Moreover, the object-oriented map shows the superior performance in environment understanding than the semi-dense map proposed in [6].

The mapping results of other sequences in TUM, Microsoft RGB-D, and Scenes V2 datasets are shown in Fig. 10. It can be seen that EAO-SLAM can process multiple classes of objects with different scales and orientations in complex environments. Inevitably, there are some inaccurate estimations. For instance, in the fire sequence, the chair is too large to be well observed by the fast moving camera, thus yielding an inaccurate estimation. We also conduct experiment in a real scenario, Fig. 11. It can be seen even the objects are occluded, they can be accurately estimated, which further verifies the robustness and accuracy of our system.

Tum Microsoft RGBD Scenes V2

fr1_desk fr2_desk fr3_long_office fr3_teddy Chess Fire Office Pumpkin Heads 01 07 10 13 14

- 11 15 2 5 4 10 4 - 5 - 6 3 4

14 22 42 6 13 6 21 6 15 7 7 7 3 5

16 23 45 7 16 6 27 6 18 8 7 7 3 6


fr3_long_office fr1_desk fr2_desk Mean

book1 book2 book3 keyboard1 keyboard2 mouse Book1 Book2 Tvmonitor1 Tvmonitor2 keyboard Book1 Book2 mouse

19.2 11.4 16.2 10.3 7.4 11.3 33.5 15.2 32.7 22.5 8.9 15.5 16.9 8.7 16.4

5.3 5.5 6.2 7.2 4.2 6.4 8.6 8.9 6.0 11.4 5.5 3.8 10.1 7.5 6.9

3.1 4.3 5.7 2.5 2.8 4.3 5.4 7.6 8.7 10.2 3.9 5.1 6.4 7.9 5.6

Fig. 9: Different map representations. (a): the RGB images. (b): the sparse map. (c) semi-dense map. (d) our semi-dense semantic map. (e) our lightweight and object-oriented map. (d) and (e) are build by the proposed EAO-SLAM.
Fig. 10: Results of EAO-SLAM on the three datasets. Top: raw images. Bottom: simi-dense object-oriented map
Fig. 11: Results of EAO-SLAM in a real scenario. Left and right: raw images. middle: semi-dense object-oriented map.

Vii Conclusion

In this paper, we present the EAO-SLAM system that aims to build semi-dense or lightweight object-oriented maps. The system is implemented based on a robust ensemble data association method and an accurate pose estimation framework. Extensive experiments show that our proposed algorithms and SLAM system can build accurate object-oriented maps with object poses and scales accurately registered. The methodologies presented in this work further push the limits of semantic SLAM and will facilitate related researches on robot navigation, mobile manipulation, and human-robot interaction.


  • [1] S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas (2017) Probabilistic data association for semantic slam. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 1722–1729. Cited by: §II-A.
  • [2] S. Choudhary, L. Carlone, C. Nieto, J. Rogers, Z. Liu, H. I. Christensen, and F. Dellaert (2016) Multi robot object-based slam. In International Symposium on Experimental Robotics, pp. 729–741. Cited by: §II-B.
  • [3] D. Frost, V. Prisacariu, and D. Murray (2018) Recovering stable scale in monocular slam using object-supplemented bundle adjustment. IEEE Transactions on Robotics 34 (3), pp. 736–747. Cited by: §II-A, §II-B, §II-B.
  • [4] V. Gaudillière, G. Simon, and M. Berger (2019) Camera pose estimation with semantic 3d model. Cited by: §II-B.
  • [5] M. Grinvald, F. Furrer, T. Novkovic, J. J. Chung, C. Cadena, R. Siegwart, and J. Nieto (2019) Volumetric instance-aware semantic mapping and 3d object discovery. IEEE Robotics and Automation Letters 4 (3), pp. 3037–3044. Cited by: §I, §II-A.
  • [6] S. He, X. Qin, Z. Zhang, and M. Jagersand (2018) Incremental 3d line segment extraction from semi-dense slam. In

    2018 24th International Conference on Pattern Recognition (ICPR)

    pp. 1658–1663. Cited by: §VI-D.
  • [7] A. Iqbal and N. R. Gans (2018)

    Localization of classified objects in slam using nonparametric statistics and clustering

    In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 161–168. Cited by: §I, §II-A, §VI-B, §VI-B.
  • [8] N. Joshi, Y. Sharma, P. Parkhiya, R. Khawad, K. M. Krishna, and B. Bhowmick (2019) Integrating objects into monocular slam: line based category specific models. arXiv preprint arXiv:1905.04698. Cited by: §II-B, §II-B.
  • [9] J. Li, D. Meger, and G. Dudek (2019) Semantic mapping for view-invariant relocalization. In 2019 International Conference on Robotics and Automation (ICRA), pp. 7108–7115. Cited by: §II-A, §II-B, §II-B, §V.
  • [10] F. T. Liu, K. M. Ting, and Z. Zhou (2012)

    Isolation-based anomaly detection

    ACM Transactions on Knowledge Discovery from Data (TKDD) 6 (1), pp. 1–39. Cited by: §V.
  • [11] Y. Liu, Y. Petillot, D. Lane, and S. Wang (2019) Global localization with object-level semantics and topology. In 2019 International Conference on Robotics and Automation (ICRA), pp. 4909–4915. Cited by: §I, §II-A.
  • [12] B. Mu, S. Liu, L. Paull, J. Leonard, and J. P. How (2016) Slam with objects using a nonparametric pose graph. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4602–4609. Cited by: §II-A, §II-B.
  • [13] R. Mur-Artal and J. D. Tardós (2017)

    Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras

    IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §III.
  • [14] A. K. Nellithimaru and G. A. Kantor (2019) ROLS: robust object-level slam for grape counting. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 0–0. Cited by: §II-B.
  • [15] L. Nicholson, M. Milford, and N. Sünderhauf (2018) QuadricSLAM: dual quadrics from object detections as landmarks in object-oriented slam. IEEE Robotics and Automation Letters 4 (1), pp. 1–8. Cited by: §I, §II-B, §II-B, §II-B.
  • [16] K. Ok, K. Liu, K. Frey, J. P. How, and N. Roy (2019) Robust object-based slam for high-speed autonomous navigation. In 2019 International Conference on Robotics and Automation (ICRA), pp. 669–675. Cited by: §I, §II-A.
  • [17] P. Parkhiya, R. Khawad, J. K. Murthy, B. Bhowmick, and K. M. Krishna (2018) Constructing category-specific models for monocular object-slam. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §I, §II-B.
  • [18] J. Peng, X. Shi, J. Wu, and Z. Xiong (2019) An object-oriented semantic slam system towards dynamic environments for mobile manipulation. In 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), pp. 199–204. Cited by: §II-B.
  • [19] J. Peng, X. Shi, J. Wu, and Z. Xiong (2019) An object-oriented semantic slam system towards dynamic environments for mobile manipulation. In 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), pp. 199–204. Cited by: §II-B.
  • [20] M. Runz, M. Buffier, and L. Agapito (2018) Maskfusion: real-time recognition, tracking and reconstruction of multiple moving objects. In 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 10–20. Cited by: §I, §V.
  • [21] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison (2013) Slam++: simultaneous localisation and mapping at the level of objects. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1352–1359. Cited by: §I, §II-B.
  • [22] M. Strecke and J. Stuckler (2019) EM-fusion: dynamic object-level slam with probabilistic data association. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5865–5874. Cited by: §I, §II-A, §II-B.
  • [23] F. Wilcoxon (1992) Individual comparisons by ranking methods. In Breakthroughs in statistics, pp. 196–202. Cited by: §IV-A, §IV-A.
  • [24] S. Yang, Z. Kuang, Y. Cao, Y. Lai, and S. Hu (2019) Probabilistic projective association and semantic guided relocalization for dense reconstruction. In 2019 International Conference on Robotics and Automation (ICRA), pp. 7130–7136. Cited by: §II-A.
  • [25] S. Yang and S. Scherer (2019) Cubeslam: monocular 3-d object slam. IEEE Transactions on Robotics 35 (4), pp. 925–938. Cited by: §I, §II-A, §II-B, §II-B.
  • [26] H. Zhang, A. Geiger, and R. Urtasun (2013) Understanding high-level semantics by modeling traffic patterns. In Proceedings of the IEEE international conference on computer vision, pp. 3056–3063. Cited by: §II-A.