Track Advancement of SLAM 跟踪SLAM前沿动态【IROS 2019 SLAM updated】
In this paper, we propose a novel dense surfel mapping system that scales well in different environments with only CPU computation. Using a sparse SLAM system to estimate camera poses, the proposed mapping system can fuse intensity images and depth images into a globally consistent model. The system is carefully designed so that it can build from room-scale environments to urban-scale environments using depth images from RGB-D cameras, stereo cameras or even a monocular camera. First, superpixels extracted from both intensity and depth images are used to model surfels in the system. superpixel-based surfels make our method both run-time efficient and memory efficient. Second, surfels are further organized according to the pose graph of the SLAM system to achieve O(1) fusion time regardless of the scale of reconstructed models. Third, a fast map deformation using the optimized pose graph enables the map to achieve global consistency in real-time. The proposed surfel mapping system is compared with other state-of-the-art methods on synthetic datasets. The performances of urban-scale and room-scale reconstruction are demonstrated using the KITTI dataset and autonomous aggressive flights, respectively. The code is available for the benefit of the community.READ FULL TEXT VIEW PDF
We propose a novel dense mapping framework for sparse visual SLAM system...
Monocular visual SLAM has become an attractive practical approach for ro...
In this paper, we present a new system for live collaborative dense surf...
We present a real-time dense geometric mapping algorithm for large-scale...
Globally consistent dense maps are a key requirement for long-term robot...
In this paper we present a complete SLAM system for RGB-D cameras, namel...
Real-time 3D reconstruction enables fast dense mapping of the environmen...
Track Advancement of SLAM 跟踪SLAM前沿动态【IROS 2019 SLAM updated】
This is the open-source version of ICRA 2019 submission "Real-time Scalable Dense Surfel Mapping". Full code will be released when the paper is accepted.
Estimating the surrounding 3D environment is one of the fundamental abilities for robots to navigate safely or operate high-level tasks. To be usable in mobile robot applications, the mapping system needs to fulfill the following four requirements. First, the 3D reconstruction has to densely cover the environment in order to provide sufficient information for navigation. Second, the mapping system should have good scalability and efficiency so that it can be deployed in different environments using limited onboard computation resources. From room-scale (several meters) to urban-scale (several kilometers) environments, the mapping system should maintain both run-time efficiency and memory efficiency. Third, global consistency is required in the mapping systems. If loops are detected, the system should be able to deform the map in real-time to maintain consistency between different visits. Fourth, to be usable in different robot applications, the system should be able to fuse depth maps of different qualities from RGB-D cameras, stereo cameras or even monocular cameras.
In recent years, many methods have been proposed to reconstruct the environment using RGB-D cameras focusing on several requirements mentioned above. KinectFusion  is a pioneering work that uses the truncated signed distance field (TSDF)  to represent 3D environments. Many following works improve the scalability (e.g. Kintinuous ), the efficiency (e.g. CHISEL ), and the global consistency (e.g. BundleFusion ) of TSDF-based methods. Surfel-based methods model the environment as a collection of surfels. For example, ElasticFusion  uses surfels to reconstruct the scene and achieves global consistency. Although all these methods achieve impressive results using RGB-D cameras, extending them to fulfill all four requirements and to be usable in different robot applications is non-trivial.
In this paper, we propose a mapping method that fulfills all four requirements and can be applied to a range of mobile robotic systems. Our system uses state-of-the-art sparse visual SLAM systems to track camera poses and fuses intensity images and depth images into a globally consistent model. Unlike ElasticFusion  that treats each pixel as a surfel, we use superpixels to represent surfels. Pixels are clustered into superpixels if they share similar intensity, depth, and spatial locations. Modeling superpixels as surfels greatly reduces the memory requirement of our system and enables the system to fuse noisy depth maps from stereo cameras or a monocular camera. Surfels are organized according to the keyframes they are last observed in. Using the pose graph of the SLAM systems, we further find locally consistent keyframes and surfels that the relative drift between each other is negligible. Only locally consistent surfels are fused with input images, achieving fusion time and local accuracy. Global consistency is achieved by deforming surfels according to the optimized pose graph. Thanks to the careful design, our system can be used to reconstruct globally consistent urban-scale environments in real-time without GPU acceleration.
In summary, the main contributions of our mapping method are the following.
We use superpixels extracted from both intensity and depth images to model surfels in the system. Superpixels enable our method to fuse low-quality depth maps. Run-time efficiency and memory efficiency are also gained by using superpixel-based surfels.
We further organize surfels accordingly to the pose graph of the sparse SLAM systems. Using this organization, locally consistent maps are extracted for fusion, and the fusion time maintains regardless of the reconstruction scale. Fast map deformation is also proposed based on the optimized pose graph so that the system can achieve global consistency in real-time.
We implement the proposed dense mapping system using only CPU computation. We evaluate the method using public datasets and demonstrate its usability using autonomous aggressive flights. To the best of our knowledge, the proposed method is the first online depth fusion approach that achieves global consistency in urban-scale using only CPU computation.
Most online dense reconstruction methods take depth maps from RGB-D cameras as input. In this section, we introduce different methods to extend the scalability, global consistency and run-time efficiency of these mapping systems.
Kintinuous  extends the scalability of mapping systems by using a cyclical buffer. The TSDF volume is virtually transformed according to the movement of the camera. Voxel hashing proposed by Nießner et al.  is another solution to improve the scalability. Due to the sparsity of the surfaces in the space, only valid voxels are stored using hashing functions. DynSLAM  reconstructs urban-scale models using hashed voxels and a high-end GPU to accelerate. Surfel-based methods are relatively scalable compared with voxel-based methods because only surfaces are stored in the system. Without the explicit data optimization, ElasticFusion  can build room-scalable environments in detail. Fu et al.  further increase the scalability of surfel-based methods by maintaining a local surfel set.
To remove the drift from camera tracking and maintain global consistency, mapping systems should be able to fast deform the model when loops are detected. Whelan et al.  improved Kintinuous  with point clouds deformation. A deformation graph is constructed incrementally as the camera moves. When loops are detected, the deformation graph is optimized and applied to the point clouds. Surfel-based methods usually deform the map using similar methods. BundleFusion  introduces another solution to achieve global consistency using de-integration and reintegration of RGB-D frames. When the camera poses are updated due to the pose graph optimization, RGB-D frames are firstly de-integrated from the TSDF volume and reintegrated using the updated camera poses. Submaps are used by many TSDF-based methods, such as InfiniTAM , to generate globally consistent results. These methods divide the space into multiple low-drift submaps and merge them into a global model using updated poses.
Different methods have been proposed to accelerate the fusion process. Steinbrücker et al.  use an octree as the data structure to represent the environment. Voxblox  is designed for planning that both TSDF and the Euclidean signed distance fields are calculated. Voxblox  proposes a grouped raycasting to speed up the integration, and a novel weighting strategy to deal with the distortion caused by large voxel sizes. FlashFusion  uses valid chunk selection to speed up the fusion step and achieves global consistency based on the reintegration method. Most of the surfel-based mapping systems require GPUs to render index maps for data association. MREMap  defines octree-organized voxels as surfels so that it does not need GPUs. However, the reconstructed model of MREMap is voxels instead of meshes.
The system architecture is shown in Fig. 2. Our system fuses intensity and depth image pairs into a globally consistent model. We use a state-of-the-art sparse visual SLAM system (e.g. ORB-SLAM2  or VINS-MONO ) as the localization system to track the motion of the camera, detect loop closures, and optimize the pose graph. The keys to our mapping system are (1) superpixel-based surfels, (2) pose graph-based surfel fusion, and (3) fast map deformation. For each intensity and depth image input, the localization system generates camera tracking results and provides an updated pose graph. If the pose graph is optimized, our system first deforms all the surfels in the map database to ensure global consistency. After the deformation, the mapping system initializes surfels based on the extracted superpixels from the intensity and depth images. Then, local surfels are extracted from the map database according to the pose graph and fused with the initialized surfels. Finally, both the fused surfels and newly observed surfels are added back into the map database. Fig. 3 illustrates the pipeline of the system to process two frames when loops are detected.
Surfels are used to represent the environment. Each surfel has the following attributes: position , normal , intensity , weight , radius , update times , and the index of attached keyframe . Update times
is used to detect temporarily outliers or dynamic objects, andindicates the last keyframe the surfel is observed in.
Inputs of our system are intensity images, depth images, the ego-motion of the camera, and the pose graph from the SLAM system. The -th intensity image is and the -th depth image is . A 3D point in the camera frame can be projected into the image as a pixel using the camera projection function: . A pixel can be back-projected into the camera frame as a point: where is the depth of the pixel.
We use a sparse visual SLAM method as the localization system to track the camera motion and optimize the pose graph when there are loop closures. For each frame, the localization system estimates the camera pose and gives out the reference keyframe that shares most features with . includes a rotation matrix
and a translation vector. Using , a point in the camera frame of can be transformed into the global frame . A vector (such as the surfel normal) in the camera frame can be transformed into the global frame . Similiarly, and can be transformed back into the camera frame of using .
The pose graph used in our system is an undirected graph similar to the covisibility graph in ORB-SLAM2. Vertices in the graph are the keyframes maintained in the SLAM system, and edges indicate keyframes share common features. Since the relative poses of frames are constrained by common features in the sparse SLAM systems by bundle adjustments, we assume keyframes are locally consistent if the minimum number of edges between each other is less than .
If the pose graph of the localization system is updated, our method deforms all the surfels to keep the global consistency before the surfel initialization and fusion. Unlike previous methods that use a deformation graph embedded in the global map, we deform the surfels so that the relative pose between each surfel and its attached keyframe remains unchanged. Although surfels that are attached to the same keyframe are deformed rigidly, the overall deformation of the map is non-rigid.
For a surfel that is attached to keyframe , the position and normal of the surfel are transformed using , where and are the poses of keyframe before and after the optimization, respectively. After the deformation, the transformation is replaced by the optimized pose for the next deformation.
Unlike other surfel-based methods that model per-pixel surfels, we extract surfels based on extracted superpixels from intensity and depth images. Using superpixels greatly reduces the memory burden of our system when applied to large-scale missions. More importantly, outliers and noises from low-quality depth maps can be reduced based on extracted superpixels. This novel representation enables us to reconstruct the environment using stereo-cameras, or even monocular cameras.
Superpixels are extracted by a -means approach adapted from SLIC . The original SLIC operates on RGB images and we extend it to segment both intensity and depth images. Pixels are clustered according to their intensity, depth and spatial location by firstly initializing the cluster centers and then alternating between the assignment step and the update step. A major improvement compared with SLIC is that our superpixel segmentation operates on images where not all pixels have valid depth measurements.
The cluster center is initialized on a regular grid on the image. is the average location of clustered pixels, is the average depth, is the average intensity value, and is the radius of the superpixel defined as the largest distance between the assigned pixels to . is initialized as the location of the center. and are initialized as the depth and intensity value of pixel . For cluster centers that are initialized on pixels with no valid depth estimations, the depth is initialized as NaN.
In the assignment step, the per-cluster scan from SLIC is replaced by the per-pixel update so that invalid depth can be handled while the complexity remains unchanged. We defined two distances between one pixel and one candidate cluster center as
where and are the distances with and without depth information, respectively. [, ], and are the location, depth and intensity of pixel , respectively. , and are used to normalize the distance, color and depth proximity, respectively, before the summation. Each pixel scans the four neighbor candidate cluster centers. If pixel and all the centers have valid depth values, then the assignment is done by comparing . Otherwise, is used for the assignment.
Once all pixels have been assigned, the cluster centers are updated. , , and are updated by the average of all the assigned pixels. The mean depth , on the other hand, is updated by minimizing a Huber loss with radius :
where is the assigned pixel that has a valid depth value and is its depth. can be estimated by Gauss-Newton iterations. This outlier-robust mean depth not only enables the system to process low-quality depth maps but also preserves the depth discontinuity.
For a superpixel cluster center that has enough assigned pixels, we initialize one surfel in an outlier-robust way. The intensity is initialized as the mean intensity of the cluster . is initialized as the index of the reference keyframe given by the sparse SLAM system. is initialized as meaning that the surfel has not been fused by other frames.
The position and normal are initialized by using the information from all pixels of the superpixel. is initialized as the average normal of these pixels and then fine-tuned by minimizing a fitting error defined as:
where , is the mean of the 3D points , and estimates the bias. is defined as the point on the surfel that is observed by the camera as a pixel :
and can be solved in closed-form as:
where is the camera intrinsic matrix.
The surfel radius is initialized so that the projection of it can cover the extracted superpixel in the input intensity image:
where is the depth of the surfel, and is the camera focal length.
Most of the depth estimation methods, like stereo matching, or active stereos (e.g. Ultrastereo ) work by firstly estimating the pixel disparity and then inverting it into depth values , where
is the baseline of the sensors. Assuming the variance of disparity estimation is, is initialized as the inverse variance of the estimated surfel depth:
Reconstructing large-scale environments may generate millions of surfels. However, only a subset of surfels are extracted based on the pose graph to fuse with initialized surfels due to the following reasons. Firstly, the local map fusion ensures update time regardless of the reconstruction scale, and secondly, due to the accumulated tracking error of the sparse SLAM system, fusing surfels that have large drift ruins the system so that it cannot achieve global consistency even if loops are detected afterward.
Here, we introduce a novel approach that uses the pose graph from the localization system to identify local maps. With the assumption in Section IV-B that keyframes with the number of minimum edges to the current keyframe below are locally consistent, we extract surfels attached to these keyframes as the local map. Locally consistent keyframes can be found by a breadth-first search on the pose graph. When loops are detected and edges between these keyframes are added, previous surfels can be reused so that the map growth is reduced. As shown in (d) of Fig. 3, previous maps are reused due to the loop closure.
In this section, extracted local surfels in Section. IV-F are fused with newly initialized surfels in Section. IV-E. Given the current camera pose estimation , the positions and normals of local surfels are firstly transformed into the current camera frame using . Each local surfel is then back-projected into the input frame as a pixel: . If a surfel is initialized based on the superpixel containing , we determine the correspondence if they have similar depth and normals: , and . is fused with the corresponding surfel :
After the fusion, all local surfels are transformed into the global frame using and are moved into the global map. Surfels that are initialized in this frame but have not been fused with local maps are also transformed and added into the global map. To handle outliers, surfels with but are updated less than times are removed.
The surfel mapping system is implemented using only CPU computing and achieves real-time performance even when it reconstructs urban-scale environments. Superpixels are initialized on the regular grid spaced pixels apart. The small-sized superpixels give the system a balance between efficiency and reconstruction accuracy. , and are used during the pixel assignment in Equation 1 and Equation 2. During the surfel initialization and fusion, superpixels with more than assigned pixels are used to initialize surfels. used in the Huber loss and the disparity error are determined by the depth sensors or depth estimation methods.
In this section, we first compare the proposed mapping system with other state-of-the-art methods using the ICL-NIUM . The performance of the proposed system in large-scale environments is also analyzed using the KITTI dataset . The platform to evaluate our method is a workstation with an Intel i7-7700. Finally, we use the reconstructed map to support UAV autonomous aggressive flights to demonstrate the usability of the system. In the experiments, we show that the proposed method can fuse depth maps from stereo matching, depth prediction, and monocular depth estimation.
We evaluate the accuracy of the reconstructed models using ICL-NIUM  and compare it with that of other mapping methods. The dataset provides rendered RGB images and the corresponding depth maps from a synthetic room. To simulate real-world data, the dataset adds noise to both RGB images and depth images. , are used for surfel initialization. We use ORB-SLAM2 in RGB-D mode to track the camera motion. is used to extract the local map for fusion.
The reconstruction accuracy is defined as the mean difference between the reconstructed model and the ground truth model. Here, we compare the proposed mapping method with BundleFusion , ElasticFusion , InfiniTAM  and the recently published FlashFusion . To evaluate the ability to maintain the global consistency, we also evaluate Ours w/o loop in which the loop closure in ORB-SLAM2 is disabled.
The result is shown in Table I and Fig. 4. Please note that only FlashFusion  and our proposed system do not need GPU acceleration. BundleFusion , on the other hand, uses two high-end desktop GPU for frame reintegration and stores all the fused RGB-D frames. Although our method is designed for large-scale efficient reconstruction, it achieves similar results compared with FlashFusion. Only kt3 has global loops, and our method reduces the reconstruction error from to by removing the drift during motion tracking.
|Ours w/o loop||0.7||0.9||1.1||1.7|
Most of the previous online dense reconstruction methods focus on room-scale environments using RGB-D cameras. Here, thanks to the memory and computation efficiency, we show that our method can reconstruct much larger environments, such as streets in KITTI datasets. Both the fusion update time and the memory usage are studied when the reconstruction scale grows. We use PSMNet  to generate depth maps from stereo images and use ORB-SLAM2 in stereo mode to track the moving camera. , are set according to the environment and the stereo method. Here, we use KITTI odometry sequences 00 for the evaluation.
The first row of Fig. 1 shows the reconstruction result and the detail of one looped corner. Fig. 5 shows the map before and map the map deformation. The time efficiency of our method during the KITTI sequences 00 reconstruction is shown in Fig. 6. As shown in the figure, the average fusion time is around ms per-frame, making our method more than Hz real-time using only CPU computation. Unlike other dense mapping methods, such as TSDF-based methods, our method spends most of the time extracting superpixels and initializing surfels. The outlier-robust superpixel extraction and surfel initialization enable our system to use low-quality stereo depth maps. On the other hand, the surfel fusion only consumes less than ms regardless of the environment scale. Due to the fact that ORB-SLAM2 optimizes the whole pose-graph frequently, our system deforms the map accordingly to maintain global consistency. The memory usage of the system during the runtime is shown in Fig. 7. Between frame 3000 and 4000, the vehicle revisits one street and ORB-SLAM2 detects loop closures between the keyframes. Based on the updated pose graph, our system reuses previous surfels so that the memory grows according to the environment scale instead of the runtime.
One of the advantages of the proposed method is that it can fuse depth maps from different kinds of sensors. In the previous sections, we showed dense mapping using rendered RGB-D images and stereo cameras. In this section, the proposed dense mapping system is used to reconstruct the KITTI sequence using only one monocular camera. Only the left images from the dataset are used to predict the depth maps , and the camera poses are tracked using ORB-SLAM2 in RGB-D mode (with the left image and the predicted depth map). The reconstruction result is shown in the bottom row of Fig 1. During the fusion, is set to according to the variance of the monocular depth estimation. Our method is the first one that reconstructs KITTI sequences with scale using predicted depth maps.
To prove the usability of the proposed dense mapping, we apply the system to support autonomous aggressive flights. A dense model of the environment is first built by a handheld monocular camera. Then, a flight path is generated so that the quadrotor can navigate safely and aggressively in the environment. MVDepthNet  is used to estimate monocular depth maps and VINS-MONO is used to track the camera motion. During the scene reconstruction, the proposed mapping approach corrects map drift according to the detected loops so that obstacles are consistent between different visits. We also compare the reconstruction results with CHISEL 222https://github.com/personalrobotics/OpenChisel using the same input images and camera poses. Since CHISEL cannot deform the map to eliminate the detected drift, fine obstacles cannot be reconstructed right when they are revisited. The results are shown in Fig. 8. Aggressive flights using the reconstructed maps can be found in the supplementary video. Please note that all indoor obstacles and outdoor trees are constructed accurately using our method. On the other hand, CHISEL  cannot reconstruct fine obstacles due to the drift between different visits and the maps are not usable for autonomous flights.
In this paper, we propose a novel surfel mapping method that can fuse sequential depth maps into a globally-consistent model in real-time without GPU acceleration. The system is carefully designed so that it can handle low-quality depth maps and maintain run-time efficiency. Surfels used in our system are initialized using extracted outlier-robust superpixels. Surfels are further organized according to the pose graph of the localization system so that the system maintains fusion time and can deform the map to achieve global consistency in real-time. All the characteristics of the system make the proposed mapping system suitable for robot applications.
This work was supported by the Hong Kong PhD Fellowship Scheme.
Proc. of the IEEE Int. Conf. on Pattern Recognition, pp. 5410–5418. Cited by: §VI-B.
Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Real-time Scalable Dense Surfel Mapping, §VI.
MVDepthNet: real-time multiview depth estimation neural network. In Proc. of the Int. Conf. on 3D Vis., Cited by: §VI-D.