Track Advancement of SLAM 跟踪SLAM前沿动态【IROS 2019 SLAM updated】
Simultaneous Localization And Mapping (SLAM) is a fundamental problem in mobile robotics. While sparse point-based SLAM methods provide accurate camera localization, the generated maps lack semantic information. On the other hand, state of the art object detection methods provide rich information about entities present in the scene from a single image. This work incorporates a real-time deep-learned object detector to the monocular SLAM framework for representing generic objects as quadrics that permit detections to be seamlessly integrated while allowing the real-time performance. Finer reconstruction of an object, learned by a CNN network, is also incorporated and provides a shape prior for the quadric leading further refinement. To capture the dominant structure of the scene, additional planar landmarks are detected by a CNN-based plane detector and modelled as landmarks in the map. Experiments show that the introduced plane and object landmarks and the associated constraints, using the proposed monocular plane detector and incorporated object detector, significantly improve camera localization and lead to a richer semantically more meaningful map. The performance of our SLAM system is demonstrated in https://youtu.be/UMWXd4sHONw .READ FULL TEXT VIEW PDF
Simultaneous Localization And Mapping (SLAM) is a fundamental problem in...
We present a real-time object-based SLAM system that leverages the large...
Existing simultaneous localization and mapping (SLAM) algorithms are not...
Semantic aware reconstruction is more advantageous than geometric-only
This paper presents a state-of-the-art approach in object detection for ...
We present a new paradigm for real-time object-oriented SLAM with a mono...
We introduce a new real-time pipeline for Simultaneous Localization and
Track Advancement of SLAM 跟踪SLAM前沿动态【IROS 2019 SLAM updated】
Simultaneous Localization And Mapping (SLAM) is one of the fundamental problems in mobile robotics  that aims to reconstruct a previously unseen environment while localizing a mobile robot with respect to it. The representation of the map is an important design choice as it directly affects its usability and precision. A sparse and efficient representation for Visual SLAM is to consider the map as collection of points in 3D, which carries information about geometry but not about the semantics of the scene. Denser representations [2, 3, 4, 5, 6], remain equivalent to a collection of points in this regard.
Man-made environments contain many objects that can be used as landmarks in a SLAM map, encapsulating a higher level of abstraction than a set of points. Previous object-based SLAM efforts have mostly relied on a database of predefined objects – which must be recognized and a precise 3D model fit to match the observation in the image to establish correspondence . Other work  has admitted more general objects (and constraints) but only in a slow, offline structure-from-motion context. In contrast, we are concerned with online (real-time) SLAM, but we seek to represent a wide variety of objects. Like  we are not concerned with high-fidelity reconstruction of individual objects, but rather to represent the location, orientation and rough shape of objects, while incorporating fine point-cloud reconstructions on-demand. A suitable representation is therefore a quadric , which captures a compact representation of rough extent and pose while allows elegant data-association. In addition to objects, much of the large-scale structure of a general scene (especially indoors) comprises dominant planar surfaces. Planes provide information complimentary to points by representing significant portions of the environment with few parameters, leading to a representation that can be constructed and updated online 
. In addition to constraining points that lie on them, planes permit the introduction of useful affordance constraints between objects and their supporting surfaces that leads to better estimate of the camera pose.
This work aims to construct a sparse semantic map representation consisting not only of points, but planes and objects as landmarks, all of which are used to localize the camera. We explicitly target real-time performance in a monocular setting which would be impossible with uncritical choices of representation and constraints. To that end, we use the representation for dual quadrics proposed in our previous work  to represent and update general objects, however  has fundamental limitations from two aspects: (1) from front-end perspective such as: a) reliance on the depth channel for plane segmentation and parameter regression, b) pre-computation of Faster R-CNN  based object detections to permit real-time performance, and c) ad-hoc object and plane matching/tracking. (2) From the back-end perspective: a) conic observations are assumed to be axis-aligned thus limiting the robustness of the quadric reconstruction, b) all detected landmarks are maintained in a single global reference frame. This work in addition to addressing the mentioned limitations, proposes new factors amenable for real-time inclusion of plane and object detections while incorporating fine point-cloud reconstructions from a deep-learned CNN, wherever available, to the map and refine the quadric reconstruction according to this object model.
The main contributions of the paper as follows: (1) integration of two different CNN-based modules to segment planes and regress the parameters (2) integrating a real-time deep-learned object detector in a monocular SLAM framework to detect general objects as landmarks along a data-association strategy to track them, (3) proposing a new observation factor for objects to avoid axis-aligned conics, (4) representing landmarks relative to the camera where they are first observed instead of a global reference frame, and (5) wherever available, integrating the reconstructed point-cloud model of the detected object from single image by a CNN to the map and imposing additional prior on the extent of the reconstructed quadric based on the reconstructed point-cloud.
SLAM is well studied problem in mobile robotics and many different solutions have been proposed for solving it. The most recent of these is the graph-based approach that formulates SLAM as a nonlinear least squares problem . SLAM with cameras has also seen advancement in theory and good implementations that have led to many real-time systems from sparse (,) to semi-dense (, ) to fully dense (, , ).
Recently, there has been a lot of interest in extending the capability of a point-based representation by either applying the same techniques to other geometric primitives or fusing points with lines or planes to get better accuracy. In that regard,  proposed a representation for modelling infinite planes and 
use Convolutional Neural Network (CNN) to generate plane hypothesis from monocular images which are refined over time using both image planes and points. proposed a method to fuse points and planes from an RGB-D sensor. In the latter works, they try to fuse the information of planar entities to increase the accuracy of depth inference.
Quadrics based representation was first proposed in  and later used in a structure from motion setup .  reconstructs quadrics based on bounding box detections, however it is not explicitly modelled to remain bounded ellipsoids.  presented a semantic mapping system using object detection coupled with RGB-D SLAM, however object models do not inform localization.  presented an object based SLAM system that uses pre-scanned object models as landmarks for SLAM but can not be generalized to unseen objects.  presented a system that fused multiple semantic predictions with a dense map reconstruction. SLAM is used as the backbone to establish multiple view correspondences for fusion of semantic labels but the semantic labels do not inform localization.
For the sake of completeness, this section presents an overview of the representations and factors proposed originally in our previous work . The SLAM problem can be represented as a graph where represents the set of vertices (variables) that need to be estimated and represents the set of edges or factors (constraints) between the vertices. The solution of this problem is the optimum configuration of vertices, , that minimizes the overall error over the factors in the graph.
A quadric surface in 3D space can be represented by a homogeneous quadratic form defined on the 3D projective space that satisfies , where is the homogeneous 3D point and is the symmetric matrix representing the quadric surface. However, the relationship between a point-quadric and its projection into a camera (a conic) is not straightforward . A widely accepted alternative is to make use of the dual space ([18, 9, 19]) which represents a dual quadric by the envelope of planes tangent to it, viz: , which simplifies the relationship between the quadric and its projection to a conic. A dual quadric can be decomposed as where transforms an axis-aligned (canonical) quadric at the origin, , to a desired pose. Quadric landmarks need to remain bounded, i.e. ellipsoids, which requires
to have 3 positive and 1 negative eigenvalues. In we proposed a decomposition and incremental update rule for dual quadrics that guarantees this conditions and provides a good approximation for incremental update. More specifically, the dual ellipsoid is represented as a tuple where and lives in D(3) the space of real diagonal matrices, i.e. an axis-aligned ellipsoid accompanied by a rigid transformation. The proposed approximate update rule for is:
where is the mapping for updating ellipsoids, is the update for and is the update for that are carried out in the corresponding lie-algebra of (isomorphic to ) and , respectively.
In addition to the classic point-camera constraint formed by the observation of a 3D point as 2D feature point in the camera, we model constraints between higher level landmarks and their observations in the camera. These constraints also carry semantic information about the structure of the scene, such as Manhattan assumption and affordances. We present a brief overview of these constraints here. In the next sections we present the newly introduced factors regarding plane and object observations and object shape priors, induced by the single-view point-cloud reconstructions.
For a point to lie on its associated plane with the unit normal vector , we introduce the following factor between them:
which measures the orthogonal distance of the point and the plane, for an arbitrary point in the plane. is the Mahalanobis norm of and is defined as where is the associated covariance matrix.
Manhattan world assumption where planes are mostly mutually parallel or perpendicular, is modelled as:
where planes and have unit normal vectors and .
In normal situations planar structure of the scene affords stable support for common objects, for instance floors and tables support indoor objects and roads support outdoor objects like cars. To impose a supporting affordance relationship between planar entities of the scene and common objects, we introduce a factor between dual quadric object and plane as:
which models the tangency relationship between them. Please note that this tangency constraint is the direct consequence of choosing dual space for quadric representation, which would not have been straightforward in the space of point quadrics .
Man-made environments contain planar structures, such as table, floor, wall, road, etc. If modelled correctly, they can provide information about large feature-deprived regions providing more map coverage. In addition, these landmarks act as a regularizers for other landmarks when constraints are introduced between them. The dominant approach for plane detection is to extract them from RGB-D input  which provides reliable detection and estimation of plane parameters. In a monocular setting, planes need to be detected using a single RGB image and their parameters estimated, which is an ill-posed problem. However, recent breakthroughs enable us to dectect and estimate planes. Recently, PlaneNet  presented a deeply learned network to predict plane parameters and corresponding segmentation masks. While planar segmentation masks are highly reliable, the regressed parameters are not accurate enough for small planar regions in indoor scenes (See Section. VI). To address this shortcoming, we use a network that predicts depth, surface normals, and semantic segmentations. Depth and surface normal contain complementary information about the orientation and distance of the planes, while semantic segmentation allows reasoning about identity of the region such as wall, floor, etc.
We utilize the state-of-the-art joint network  to estimate depth, normals, and segmentation for each RGB frame in real-time. We exploit the redundancy in the three separate predictions to boost the robustness of the plane detection by generating plane hypothesis in two ways: 1) for each planar region in the semantic segmentation (regions such as floor, wall, etc.) we fit 3D planes using surface normals and depth for orientation and distance of the plane respectively, and 2) depth and surface normals predictions are utilized in the connected component segmentation of the reconstructed point-cloud in a parallel thread ([25, 11]). Plane detection is considered to be valid if the cosine distance of normal vectors and also the distance between the value of the two planes from two estimations are within a certain threshold. The corresponding plane segmentation is taken to be the intersection of the plane masks of the two hypotheses.
Note that the association between inlier 3D point landmarks and planes, useful for the factor described in III-C, is extracted from this resulted mask. The 3D point is considered as an inlier if the corresponding 2D keypoint inside the mask also satisfies the certain geometric distance threshold.
Once initialized and added to the map, the planes need to be associated with the planes in the incoming frames. Matching planes is more robust than feature point matching due to the inherent geometrical nature of planes . To make data association more robust in cluttered scenes, when available, we additionally use the detected keypoints that lie inside the segmented plane in the image to match the observations. A plane in the map and a plane in the current frame are deemed to be a match if the number of common keypoints is higher than a threshold and the unit normal vector and distance of them are within certain threshold. If the number of common keypoints is less than another threshold (or zero for feature-deprived regions) meaning that there is no corresponding map plane for the detected plane, the observed plane is added to the map as a new landmark. The map can now contain two or more planar regions that might belong to the same infinite plane such as two tables with same height in the office. However, additional constraints on parallel planes are also introduced according to evidence (Section III-C).
After successful data association, we can introduce the observation factor between the plane and the camera (keyframe). We use a relative key-frame formulation (instead of the global frame) for each plane landmark and a plane landmark is expressed relative to the first key-frame () that observes it. For an observation from a camera pose , the multi-edge factor (connecting more than two nodes) for measuring the plane observation is given by:
where is the transformed plane from its reference frame to the camera coordinate frame and is the geodesic distance of the  and is the pose of the camera which takes a point in the current camera frame () to a point in the world frame .
As noted earlier, incorporating general objects in the map as quadrics leads to a compact representation of the rough 3D extent and pose (location and orientation) of the object while facilitating elegant data association. State-of-the-art object detector such as YOLOv3  can provide object labels and bounding boxes in real-time for general objects. The goal of introducing objects in SLAM is both to increase the accuracy of the localization and to yield a richer semantic map of the scene. While our SLAM proposes a sparse and coarse realization of the objects, wherever the fine model reconstruction of each object is available it can be seamlessly incorporated on top of the corresponding quadric and even refines the quadric reconstruction as discussed in V-B.
For real-time detection of objects, we use YOLOv3  trained on COCO dataset  that provides axis detections as aligned bounding boxes for common objects. For reliability we consider detections with 85% or more confidence.
To rely solely on the geometry of the reconstructed quadrics (by comparing re-projection errors) to track the object detections against the map is not robust enough particularly for high-number of overlapping or partially-occluded detections. Therefore to find optimum matches for all the detected objects in current frame, we solve the classic optimum assignment problem with Hungarin/Munkres  algorithm. The challenge of using this classic algorithm is how to define the appropriate cost matrix. We establish the cost matrix of this algorithm based on the idea of maximizing the number of common robustly matched keypoints (2D ORB features) inside the detected bounding boxes. Since we want to solve the minimization problem, the cost matrix is defined as:
where gives the number of projected keypoints associated with candidate quadric inside the bounding box , and is the maximum number of all of these projected keypoints. and are the total number of bounding box detections in current frame and candidate quadrics of the map for matching, respectively. Candidate quadrics for matching are considered to be the quadrics of the map that are currently in front of the camera.
To reduce the number of mismatches even more, after solving the assignment problem with the proposed cost matrix, the solved assignment of to is considered successful if the number of common keypoints satisfies a certain high threshold and the new quadric will be initialized in the map if . Assignments with values between these thresholds will be ignored.
In this section, we present a method of estimating fine geometric model of available objects established on top of quadrics to enrich their inherent coarse representation. It is difficult to estimate the full 3D shape of objects from sparse views using purely classic geometric methods. To bypass this limitation, we train a CNN adapted from Point Set Generation Net  to predict (or hallucinate) the accurate 3D shape of objects as point clouds from single view RGB images.
The CNN is trained on a CAD model repository ShapeNet . We render 2D images of CAD models from random viewpoints and, to simulate the background in real images, we overlay random scene backgrounds from the SUN dataset  on the rendered images. We demonstrate the efficacy of this approach for outdoor scenes, particularly for general car objects in KITTI  benchmark in section VI-B. Running alongside with the SLAM system, the CNN takes an amodal detected bounding box of an object as input and generates a point cloud to represent the 3D shape of the object. However, to ease the training of the CNN, the reconstructed point cloud is in a normalized scale and canonical pose. To incorporate the point cloud into the SLAM system, we need to estimate seven parameters to scale, rotate and translate this point cloud. First we compute the minimum enclosing ellipsoid of the normalized point cloud, and then estimate the parameters by aligning it to the object ellipsoid from SLAM.
After registering the reconstructed point-cloud and the quadric from SLAM, we impose a further constraint only on the shape (extent) of the quadric, feasible due to the decomposition of quadric representation. This prior affects the ratio of major axes of the quadric by computing the intersection over union of the registered enclosing normalized cuboid of the point-cloud and enclosing normalized cuboid of the quadric:
where is a function that gives the normalized enclosing cuboid of an ellipsoid.
As an expedient approach, we currently pick a single high-quality detected bounding box as the input to the CNN, however, it is trivial to extend to multiple bounding boxes by using a Recurrent Neural Net to fuse information from different bounding boxes, as done in 3D-R2N2 .
We propose an observation factor for the quadric without enforcing that to be observed as an axis-aligned inscribed conic (ellipse). Unlike 
that uses the Mahalanobis distance of detected and projected bounding boxes, which is not robust and penalizes more for large errors and outliers, we use the error function based on Intersection-over-Union (IoU) of these bounding boxes that is also weighted according to theconfidence score of the object detector. This factor provides an inherent capped error, however it implicitly emphasizes on the significance of the good initialization of quadrics to have a successful optimization. Similar to plane landmarks, we use the relative reference key-frame to represent the coordinates of the objects, we introduce the multi-edge factor, for object observation error, between dual quadric and camera pose as:
where is the detected bounding box and is the enclosing bounding box of the projected conic with the projection matrix of the camera with calibration matrix , , and is the relative pose of the camera from the reference key-frame of the quadric.
The proposed system is built on top of the state-of-the-art ORB-SLAM2  and utilizes its front-end for tracking ORB features, while the back-end for the proposed system is implemented in C++ using g2o . Performance evaluation is carried out on publicly available TUM , NYUv2 , and KITTI  datasets that contain rich planar low-texture scenes to multi-object offices and outdoor scenes. Qualitative and quantitative evaluations are carried out using different mixture of landmarks and comparisons are presented against point-based monocular ORB-SLAM2 .
Qualitative evaluation on TUM and NYUv2 for sequences fr2/desk, nyu/office_1b, and nyu/nyu_office_1 is illustrated in Fig. 2 for different scenes and landmarks. Columns (a)-(d) show the image frame with tracked features and possible detected objects, detected and segmented planes, and the reconstructed map from two different viewpoints, respectively. For some low or no texture sequences in TUM and NYUv2 datasets point-based SLAM system fail to track the camera, however the present rich planar structure is exploited by our system along with the Manhattan constraints to yield more accurate trajectories and semantically meaningful maps.
The reconstructed maps are semantically rich and consistent with the ground truth 3D scene, for instance in fr2/desk, with presence of all landmarks and constraints, the map consists of planar monitor orthogonal to the desk, and quadrics corresponding to objects are tangent to the supporting desk, congruous with the real scene. Red ellipses in Fig. 2 column (a) are the projection of their corresponding quadric objects in the map. Further evaluations can be found in the supplemental video.
One of the main reasons for the improved accuracy of camera trajectory and consistency of the global map is the addressing of subtle but extremely important problem of scale drift. In a monocular setting, the estimated scale of the map can change gradually over time. In our system, the consistent metric scale of the plane detections and the presence of point-plane constraints allow observation of the absolute scale, which can further be improved by adding priors about the extent of the objects represented as quadrics.
One of the important factors that can affect the system performance is the quality of estimated plane parameters. Reconstructed maps are shown in Fig. 3 for two different monocular plane detectors incorporated in our system: a) PlaneNet , b) our proposed plane detector (See Section IV). Baseline comparison is made against a depth based plane detector that uses connected component segmentation of the point cloud ([25, 11]). The detected planes are then used in the monocular system for refinement. As seen in Fig. 3(a) PlaneNet only captures the planar table region successfully and fails for the other regions. The proposed detector captures the monitors on the table shown in column (b), however it misses the monitor behind and also reconstructs the two same height tables with a slight vertical distance. As shown in Fig. 3(c) the baseline plane detector captures the smaller planar regions more accurately and same height tables as one plane, as expected because of using additional depth information. Table II reports the comparison of these three approaches for plane detection in different sequences of TUM datasets. It can be seen that the depth based detector is the most informative, however the proposed method is better than PlaneNet in most cases.
We perform an ablation study to demonstrate the efficacy of introducing various combinations of the proposed semantic landmarks and constraints. The RMSE of Absolute Trajectory Error (ATE) is reported in Table I. Estimated trajectories and ground-truth are aligned using a similarity transformation . In the first case, points are augmented with planes (PP) and constraint for points and corresponding planes is included. This already improves the accuracy over baseline and imposing additional Manhattan constraint in the second case (PP+M) improves ATE even further. In these two cases the error is significantly reduced by first exploiting the structure of the scene and second by reducing the scale-drift problem, as discussed earlier, using metric information about the planes.
For the sequences containing common COCO  objects, the presence of objects represented by quadric landmarks along with points is explored in the third case (PO). This case demonstrates the effectiveness of integrating objects in the SLAM map. Finally, the performance of our full monocular system (PPO+MS) is detailed in the last right column of Table I with the presence of all landmarks points, planes, and objects and also Manhattan and supporting/tangency constraints. This case shows an improvement against the baseline in all of the evaluated sequences, in particular for fr3/long_office we have seen a significant decline in ATE (18.47%) as a result of the presence of a large loop in this sequence, where our proposed multiple-edges for observations of planes and quadric objects in key-frames have shown their effectiveness in the global loop closure.
To demonstrate the efficacy of our proposed object detection factor, object tracking, and also shape prior factor induced from incorporated point-cloud (reconstructed by CNN from single-view) in our SLAM system, we evaluate our system on KITTI benchmark. For reliable frame-to-frame tracking, we use the stereo variant of ORB-SLAM2, however object detection and plane estimation are still carried out in a monocular fashion. The reconstructed map with quadric objects and incorporated point-clouds (See Section V-B) is illustrated for KITTI-7 in Fig. 4. The instances of different cars are rendered in different colors.
|Dataset||PlaneNet ||Proposed Detector||Baseline|
This work introduced a monocular SLAM system that can incorporate learned priors in terms of plane and object models in an online real-time capable system. We show that introducing these quantities in a SLAM framework allows for more accurate camera tracking and a richer map representation without huge computational cost. This work also makes a case for using deep-learning to improve the performance of traditional SLAM techniques by introducing higher level learned structural entities and priors in terms of planes and objects.
European Conference on Computer Vision. Springer, 2014, pp. 834–849.
2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, 2013, pp. 1352–1359. [Online]. Available: https://doi.org/10.1109/CVPR.2013.178
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2010, pp. 3485–3492.