Simultaneous Localization and Mapping (SLAM) is a fundamental technique in order for robots to perceive the environment. When compared with classic SLAM methods that use only the geometry of the scene, object-based SLAM has recently focused on creating maps with both geometry and high-level semantic objects within the environment[15, 12, 19, 20, 18, 11, 10, 5, 8]. This semantically-enriched information can help robots with target-oriented tasks like obstacle avoidance, robust relocalization and human-robot interaction. The improvement in the accuracy of semantic information acquisition, driven by deep learning networks[14, 2, 7], has led to the increasing introduction of object detection and semantic segmentation into visual SLAM systems to build semantically enriched maps and enhance the perception ability of robots.
Accurate object representation is a key issue in object-oriented SLAM research and 3D object models, cubic boxes[18, 20, 19] and ellipsoids[11, 10, 5] are among common methods utilized for object representation. Prior work like  and  use the cubic box to represent the object, where the pose of the cubic box can be estimated by vanishing points and rotation sampling. Compared with the cubic box, the ellipsoid can also accurately represent the position, orientation and size of the object and has a more concise mathematical representation. In projective geometry the quadric can be represented by a symmetric matrix where the compact perspective projection model and the closed surfaces of ellipsoids are meaningful for object landmarks.
The accuracy and robustness of current quadric-based SLAM are not ideal, especially the quadric initialization process, which is limited by the parameter coupling of the direct linear solution method or the necessity for point cloud fitting[10, 8]. QuadricSLAM is a recently proposed object-oriented SLAM system that represents objects as quadrics; a dual quadric observation model based on the object detection is proposed. However, the closed-form constrained dual quadric parameterization and the lack of observation angles under the planar trajectory of the mobile vehicle make the initialization of the quadric difficult and sensitive to observation noise. In , multiple constraints combined with points, surface and quadrics are used in the optimization framework, but the prior shape of the object is estimated based on deep learning which incurs a high computational complexity and is not robust. In , the texture plane and shape prior constraints are added to the quadric estimation which solves the problem of poor estimation performance when the observation angles change in road driving scenes. However, the assumption that the texture plane is parallel to the image plane during quadric initialization causes the estimation to be sensitive to noise.
In addition, in prior work such as [11, 13], data association methods have been proposed although they are typically not robust to outdoor scenes. Dynamic objects in outdoor scenes like moving cars and persons are a challenge for quadric estimation since false object associations will lead to false quadric initialization results.
To solve the aforementioned problems, we propose a robust and accurate quadric landmark initialization based on a method for decoupling of quadric parameters (DQP) and an object data association (ODA) algorithm in outdoor scenes. The robustness of DQP to observation noise is improved by independently estimating the quadric centroid translation and the yaw rotation constraint which is satisfied for autonomous vehicles in road planes in most cases. Then, an ellipsoid with improved accuracy can be obtained by a nonlinear optimizer combining the observation error, the texture plane error and the prior object size. In terms of data association, we propose a multiple-cues algorithm combined with the Hungarian assignment algorithm which improves the robustness of object pose estimation.
We demonstrate the performance of the proposed system in both a simulation environment and using the KITTI Raw Data  datasets. The experimental results show that the proposed system is more robust to observation noise than other existing methods and improves the accuracy of the position, orientation and size of the object estimation in the outdoor environment.
The main contributions of this work are:
To effectively overcome the observation noise, we propose an accurate and robust quadric landmark initialization method based on the DQP algorithm by decoupling of translation and rotation of quadric centroids.
We proposed an ODA algorithm that combines the semantic inliers distribution, Kalman-based motion prediction, and ellipsoidal projection to achieve accurate object data association and object pose estimation.
Based on the proposed algorithms, we implement real-time stereo visual SLAM with accurate and robust ellipsoids representing objects, aiming to build an object-oriented and semantically-enhanced map for outdoor navigation.
Ii System Overview
Ii-a Mathematical Representation of a Quadric Model
For convenience of description, the notations used in this paper are as follows:
is the world coordinate, is the camera coordinate, is the reference camera coordinate of the object, and is the quadric center frame.
- The intrinsic matrix of a pinhole camera model.
- The transformation from world frame to camera frame, which is composed of a rotation and a translation .
- The camera projection matrix that contains intrinsic and extrinsic camera parameters.
- The 2D object detection bounding box (BBox).
is the segmentation instance mask, is the detection instance, is the object instance.
represent the detected object instance that is assigned to the object , and represent the class label of the detected instance and object instance respectively.
- The checking of image points that are located in the detection box.
- The quadric matrix in 3D space and is denoted as the dual quadric matrix.
- The 3-D plane surface in homogeneous coordinate and all quadric plane fulfil .
- The 9-D vector representing the attributes of the quadric, including axial length, translation and rotation.
When a dual quadric is projected onto an image plane, it creates a dual conic, following the rule . For more specific properties of the quadric, please refer to .
Ii-B System Architecture
The proposed system is shown in Fig.2. We implement our algorithms on the basis of ORB-SLAM3 , and a stereo camera is used to obtain a metric scale of the estimated trajectory for the autonomous driving scene to avoid scale ambiguity caused by monocular SLAM. However, we also highlight that our method can be used for monocular SLAM. There are two key modules, the visual SLAM module and the detection module. The visual SLAM module consists of parallel threads, including the tracking thread and the local mapping thread. Finally, the camera pose is estimated and a semantically-enhanced object map is also stored in the map database.
(1) The detection thread uses YOLOACT to acquire semantic information from the left images of the stereo pair. The output results are object detection BBoxes and the instance segmentation masks.
(2) The tracking thread takes images and estimates the camera pose from consecutive frames. Meanwhile, the thread waits for the detection instances and associates them with the existing objects in the object map database or decides whether to create a new object using the ODA algorithm. In addition, if the current frame is a keyframe and an observation satisfies the quadric initialization condition, the DQP algorithm is used for robust and accurate quadric initialization.
(3) The local mapping thread optimizes the map points of keyframes with local bundle adjustment. In addition, when the objects are observed by newly inserted keyframes, the new observation can be added to the object optimizer for nonlinear optimization of the ellipsoidal representation of objects.
(4) The map Database stores the final maps, including the geometry information of map points and the object-oriented map with ellipsoids.
Iii Decoupling of Quadric Parameters Initialization Algorithm
Iii-a Decoupling of Quadric Central Translation
We present the mathematical analysis of the dual quadric parameters to illustrate the effect of the translation component on the estimation of rotation and shape. The dual form parameters of the ellipsoid can be decomposed by eigen-decomposition in the reference camera coordinates of the object:
where is the diagonal matrix composed of the squares of the quadric axial lengths, and is the quadric centroid translation in the reference camera coordinates. The parameters of the block matrix couple the rotation and translation of the quadric. Since the length of the quadric centroid translation is much larger than that of the rotation and axes, small errors in the estimation of the quadric centroid translation have a significant impact on the accurate estimation of the dual quadric matrix, which is why QuadricSLAM  is sensitive to observation noise.
We can also see from Eq.1 that the translation parameters are independent in dual form parameters , , . Therefore we estimate the translation component parameters independently to eliminate the effect of coupling parameters, a key aspect of our approach. We triangulate the center of the 2D detection box and obtain the triangulation map point , which is almost close with the quadric center in outdoor scenes. This assumption is proved by experiments in VI-A. Observations of two or more frames of detection centers form the overdetermined equation to solve ,
where, is the -th element of 2D detection center, is the -th row of the projection matrix .
Iii-B Decoupling of Quadric Rotation and Axial Length
The rotation and quadric axial lengths are considered after the quadric centroid translation has been estimated independently. We assume that the ellipsoid of the object, such as an autonomous vehicle or robot, is under the constraint of yaw rotation, while the pitch and roll are constant at zero. This is satisfied for autonomous vehicles on the road in outdoor scenes. Therefore, we can replace the rotation matrix in Eq.1 by:
where, , , are elements of the quadric centroid translation vector.
We can simplify the linear form in  by using the landmark BBox observations and the corresponding dual quadric planes by substituting the .
The decoupled linear form of Eq.5
can be solved by singular value decomposition (SVD), where is the remaining elements of the dual quadric to be estimated.
Finally, the 9-D vector of the quadric with orientation, translation and axial lengths of the ellipsoid can be obtained by the estimated dual quadric matrix :
Iv 3D Object Observation Constraints Optimization
In the local mapping thread, we optimize the quadrics by using odometry factors and landmark factors combined with the observation of local keyframes. We define the set of detected objects as , and the set of mapped objects as . By minimizing the observation error between observed instances and associated mapped instance , of the quadric can be optimized with the following constraint:
The Huber kernel
is used to enhance the robustness of outlier observations, and thealgorithm is used to optimize the target cost function.
Iv-a The 2D detection error
The 2D detection error is used to calculate the distance error between the 2D object BBox and the detected BBox in the keyframe. Detection results near the edge of the image are ignored in order to eliminate the effect of occlusion.
Iv-B Prior axial length error
The prior axial length error is calculated by the distance between the prior axial length and the object quadric axial length with the same object class.
Iv-C Texture plane error
Similar to the method proposed by , the texture plane error is obtained by the minimum distance between the fitted texture plane and the quadric landmark. The plane parameters of the texture plane is obtained by Delauney Triangulation of the object’s map points with the normal vector and plane distance of a texture plane . The texture plane distance error can be calculated as:
V The Object Data Association algorithm
Multi-view geometry information is used for object landmark initialization, while the object detection results are obtained by the single-frame image. Therefore, it is necessary to correctly associate the detected instance of the same object within the map. We propose the ODA algorithm to integrate information for data association. The Hungarian algorithm  is used to complete the assignment with the minimum distance error. Three different distance metrics are used for affinity functions to obtain , which is the element of the cost matrix . The , and parameters are experimentally set to 0.8, 1, and 0.8 respectively.
V-a Semantic Inliers Points Distance
To overcome the overlap of the detection masks, we use Bi-directional Optical Flow (BODF) to track the keypoints within the detection mask from the last keyframe and obtain the keypoints set . We calculate the ratio of inliers corresponding to the same object class, where calculates the element numbers of the set:
V-B Intersection of Union Distance
To calculate the intersection of union distance, we use the intersection ratio between the 2D quadric landmark projection BBox of and the 2D detection result of the object instance .
V-C Prior Object Size Distance
The proposed system consists of two modules, including the SLAM module and the detection module. The overall system architecture is described in Fig.2. In order to evaluate the performance of our proposed method, we build an experimental simulation environment based on OpenGL to compare the robustness and accuracy against other state-of-the-art techniques. The KITTI Raw Data dataset is adopted as the benchmark real-world dataset to demonstrate the effectiveness of our method in outdoor scenes. All the experiments are conducted using an Intel(R) Core(TM) i7-9750H CPU@2.6GHZ, 16G memory, and Nvidia GTX 1080 Ti.
We define the following criteria for evaluation:
(1) : The intersection ratio between the ground truth (GT) and the estimated quadric projection detection.
(2) : The error of quadric centroid translation between the GT ellipsoid and the prediction estimation, indicating the accuracy of the ellipsoid position estimation.
(3) : The error of ellipsoid axial length between the GT ellipsoid and the predict estimation in the world coordinate, indicating the accuracy of the object shape estimation.
Vi-a Quantitative Evaluation of Simulation
Simulation provides GT of object positions and it is easy to test the robustness of the methods with different types of disturbance. We create the synthetic dataset with OpenGL, five cameras are evenly deployed within circular arcs to simulate the camera observation in the outdoor environment. An ellipsoid with varying shape and yaw rotation is deployed, the GT 2D object BBox and the position are provided. The yaw rotation of the ellipsoid is randomly sampled in the range of to simulate objects with rotation. To avoid the influence of random errors on the experimental results, for each type of noise, 10 ellipsoids are generated with Gaussian noise from 10 seeds resulting in a total of 100 trials.
To test the effect of different types of noise on the quadric initialization method, the relative camera poses are obtained by introducing zero-mean Gaussian noise with standard deviations in the rangeto simulate the trajectory error. In addition, a detection BBox is simulated by adding the zero-mean Gaussian noise of to the GT.
We compare methods of quadric initialization including (a) Nicholson et al  denoted as Q-SLAM, (b) Rubino et al  denoted as Conic-method, (c) the proposed method with only decoupling of the quadric central translation, denoted as Tri, and (d) the proposed initialization method denoted as Tri+Yaw.
Quantitative evaluation results of initialization methods with different types of noise are visualized in Fig.3. The plots show the trend of different evaluation criteria with the increase in noise. It can be seen from Fig.3, the results for all methods are consistent with the GT, demonstrating correctness of all methods with zero noise. The performance of all methods degrades when noise increases.
It is obvious that the Q-SLAM method is the most sensitive to noise among all the techniques. When either the translation noise reaches 15%, the rotation noise reaches 20%, or the detection BBox noise reaches 2%, Q-SLAM fails to construct ellipsoids.
Meanwhile, the Conic-method maintains relatively good results which show the robustness under the effect of translation and rotation noise. On the other hand, it can be seen that under the influence of detection BBox noise, the and of the Conic-method also increase rapidly. When the detection BBox error exceeds 4%, the Conic-method fails to initialize the ellipsoids, which indicates that the Conic-method is also sensitive to detection noise. However, the performance of the proposed method is stable as it can be seen that with translation and rotation noise, the error remains stable with the maximum axial error of 0.45m and maximum translation error of 0.89m. These metrics are also influenced within a small range by the detection BBox error with the maximum axial error and translation error of 1.02m and 2.10m. These results show that our proposed method significantly improves the robustness of initialization with the minimal growth trend of noise. The visualization results of quadric initialization are shown in Fig.4, where the red ellipsoid is the GT, and green ellipsoid is the estimation. Our proposed method outperforms all compared methods.
Vi-B Evaluation on KITTI Raw Data Dataset
To evaluate the performance of the proposed method in outdoor environments, we select the KITTI Raw Data dataset  in particular the sequences -09, -22, -23, -36, -59, and -93, which were recorded in urban and residential areas with vehicles. The dataset provides GT for vehicles, including 3-DoF object size and 6-DoF object pose. With the extrinsic parameters of sensors, we can transform the object pose to camera coordinates.
Table I shows the success rate of initialization and ellipsoid construction by different methods using different sequences. Tables II, III and IV show the experimental results of successfully constructed ellipsoids under different evaluation criteria.
From Table I, we can see that our method constructs ellipsoids for 60.2% of the vehicles and reaches an increase of 62.6% (from 37.0% to 60.23%) and 99.2% (from 30.24% to 60.23%) in success rate compared with the Conic-method and the Q-SLAM, respectively, thus confirming the effectiveness of our initialization method. For the metric, larger values indicate better construction results. As can be seen from Table II, our method outperforms the other existing methods with respect to in sequence -09 and -36, with the overall best average of 73.03%. The compared methods give better results for individual sequences because they discard some detection results that fail to be initialized. For and , smaller values indicate better construction results. As can be seen from Table III and Table IV, our method outperforms the compared methods in all cases except for sequence-22, with the average ellipsoid central translation error of 2.127m, nearly 52.2% reduction in error. In addition, our average axial length error is 0.642 m, a 50.8% reduction in error, compared with 1.369 and 0.947 for the other techniques. These experimental results show the robustness and effectiveness of the proposed method for ellipsoid representations in outdoor scenes.
Finally, we show the constructed object maps in Fig.1. The yellow ellipsoids in the map represent static vehicles and the yellow quadrics illustrate the orientation and shape of the estimated ellipsoids when projected onto the image frame. The magenta lines show the center of the ellipsoids in previous frames projected onto the current image frame, demonstrating the accuracy of the ODA algorithm. The red BBox represents the vehicles that are detected as dynamic objects and are not contained in the map.
In this work, a novel pipeline of real-time object-oriented stereo visual SLAM with 3D quadric landmarks is presented. A quadric initialization method based on the DQP algorithm is proposed to improve the robustness and success rate of ellipsoid construction. The data association is solved by the ODA algorithm which ensures highly accurate object pose estimation. Extensive experiments are conducted to show that the proposed system is accurate and robust to observation noise and significantly outperforms other methods in an outdoor environment.
In further work, we will explore finding the semantic relationships between object ellipsoids, and using the semantic information of the object map to localize and perform re-localization.
-  (2016) Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), Cited by: §V-C.
Yolact: real-time instance segmentation.
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9157–9166. Cited by: §I.
ORB-slam3: an accurate open-source library for visual, visual-inertial and multi-map slam. Cited by: §II-B.
-  (1998) Quadric reconstruction from dual-space geometry. In Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pp. 25–31. Cited by: §I.
-  (2020) Perspective-2-ellipsoid: bridging the gap between object detections and 6-dof camera pose. IEEE Robotics and Automation Letters 5 (4), pp. 5189–5196. Cited by: §I, §I.
-  (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: §I, §VI-B, §VI.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §I.
-  (2019) Real-time monocular object-model aware sparse slam. In 2019 International Conference on Robotics and Automation (ICRA), pp. 7123–7129. Cited by: §I, §I.
-  (1955) The hungarian method for the assignment problem. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §I, §V.
-  (2020) Object-oriented slam using quadrics and symmetry properties for indoor environments. Cited by: §I, §I, §I.
-  (2018) Quadricslam: dual quadrics from object detections as landmarks in object-oriented slam. IEEE Robotics and Automation Letters 4 (1), pp. 1–8. Cited by: §I, §I, §I, §I, §II-A, §III-A, §III-B, §III-B, §VI-A, TABLE I, TABLE II, TABLE III, TABLE IV.
-  (2019) Robust object-based slam for high-speed autonomous navigation. In 2019 International Conference on Robotics and Automation (ICRA), pp. 669–675. Cited by: §I, §I, §IV-C.
-  (2021) Semantic slam with autonomous object-level data association. In 2021 IEEE international conference on robotics and automation (ICRA), Cited by: §I.
-  (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §I, §II-B.
-  (2017) 3d object localisation from multi-view image detections. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1281–1294. Cited by: §I, §VI-A, TABLE I, TABLE II, TABLE III, TABLE IV.
Slam++: simultaneous localisation and mapping at the level of objects.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1352–1359. Cited by: §I.
-  (2021) Accurate and robust scale recovery for monocular visual odometry based on plane geometry. In 2021 IEEE international conference on robotics and automation (ICRA), Cited by: §II-B.
-  (2020) EAO-slam: monocular semi-dense object slam based on ensemble data association. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4966–4973. Cited by: §I, §I.
-  (2018) Monocular object and plane slam in structured environments. Cited by: §I, §I.
-  (2019) Cubeslam: monocular 3-d object slam. IEEE Transactions on Robotics 35 (4), pp. 925–938. Cited by: §I, §I.