Robots acting in real-world environments need the ability to understand their surroundings, and know their location within the environment. While the problem of geometrical mapping and localization can be solved through SLAM methods (zollhoefer2018survey), many tasks require knowledge about the semantic meaning of objects or surfaces in the environment. The robot should, for instance, be able to recognize where the obstacles are in the scene, and also understand whether those obstacles are cars, pedestrians, walls, or otherwise. The problem of building maps has been extensively studied (kostavelis2015survey). Most approaches can be grouped into the following three categories, based on map representation:
Voxel-based: The scene is discretized into voxels, either using a regular grid, or an adaptive octree. Each voxel stores the binary occupancy value (occupied, empty, unknown) or the distance to the surface commonly referred to as Signed Distance Function (SDF).
Surfel-based: The map is represented by small surface elements, which store the mean and covariance of a set of 3D-points. Surfels suffer from less discretization errors than voxels.
Mesh-based: The map is represented as a set of vertices with faces between them. This naturally fills holes and allows for fast rendering using established graphics pipelines.
Current semantic mapping systems treat the semantic information as part of the geometry, and store label probabilities per map element (voxel, sufel or mesh vertex/face). This approach has the intrinsic disadvantage of coupling the resolution of the geometrical representation to the semantics, requiring a large number of elements to represent small semantic objects or surface parts. This is an undesirable effect as it leads to unnecessary memory usage especially in man-made environments, where the geometry is mostly planar, and high geometrical detail would be redundant. Often, it suffices to represent the semantics relative to a rough geometric shape.
The key idea of our approach, visualized in Fig. 1, is to couple the scene geometry with the semantics at independent resolution by using a semantic texture mesh. In this, the scene geometry is represented by vertices and faces, whereas the semantic texture categorizes the surface with higher resolution. This allows us to represent semantics and geometry at different resolutions in order to build a large semantic map, while still maintaining a low memory usage. As our segmentation module we make use of RefineNet (lin2017refinenet)
to predict a semantic segmentation for each individual RGB view of the scene. These predictions are probabilistically fused onto the semantic texture that is supported by a coarse mesh representing the scene geometry. Having a globally persistent semantic map enables us to establish a temporal and spatial consistency that was previously unobtainable for individual-view predictor. To this end, we propose to propagate labels from the stable mesh by projection onto each camera frame, in order to retrain the semantic segmentation in a semi-supervised manner. Expectation Maximization (EM) is then carried out by alternating between fusing semantic predictions and propagating labels. This iterative refinement allows us to cope with view points which were not common in the training dataset. A predictor pretrained on street-level segmentation will not work well on images captured by a micro aerial vehicle (MAV) at higher altitudes or close to buildings. However, projecting confident semantic labels fused from street level onto less confident parts of views will enable to learn the semantic segmentation of new viewpoints (see Fig.6). We compare our method with SemanticFusion (mccormac2017semanticfusion), and evaluate the accuracy on the NYUv2 dataset (silberman2012nyu). We show that the increased resolution of the semantic texture allows for more accurate semantic maps. Finally, propagation and retraining further improve the accuracy, surpassing SemanticFusion in every class. To showcase the benefits of textured meshes in terms of scalability and speed, we also recorded a dataset spanning multiple buildings, annotated with the classes of the Mapillary dataset (neuhold2017mapillary). We demonstrate that we are able to construct a large map using both RGB and semantic information in a time- and memory-efficient manner.
2 Related Work
The annotation of large datasets is a costly and time-consuming matter. Hence, the automation of annotation as well as the transfer of knowledge across different domains and datasets are active research topics. Most networks for image segmentation or their respective backbone (e.g. RefineNet (lin2017refinenet) and Mask R-CNN (he2017mask) with ResNet-backbone (he2016resnet)
) are nowadays pretrained on large datasets like ImageNet(deng2009imagenet) and only finetuned for a specific dataset or purpose. vezhnevets2012activeclassify super pixels in an automated manner using a pairwise conditional random field and request human intervention based on the expected change. Likewise, jain2016prop use a Markov Random Field for joint segmentation across images given region proposals with similar saliency. The resulting proposals are later fused to obtain foreground masks while supervision is requested based on an images influence, diversity and the predicted annotation difficulty. Instead, yang2017suggestive
cluster unannotated data based on cosine similarity to other images and simply choose per cluster the one with most similar images for human labeling.mackowiak2018cereals take a more cost-centric approach and train one CNN for semantic segmentation and one for a cost model that estimates the necessary clicks for annotating a region. The cost model predictions are then fused with the vote entropy of the segmenting networks activation and supervision is requested for a fixed number of regions. castrejon2017prnn provide with Polygon-RNN a more interactive approach. Given a (drawn) bounding box around an object, the RNN with VGG-16 backbone (VGG16) predicts an enclosing polygon around the object. The polygon can be corrected by a human annotator and fed back into the RNN to improve the overall annotation accuracy. acuna2018prnnpp
improve upon Polygon-RNN through architecture modifications, training with reinforcement learning and increased polygonal output resolution. Most semantic segmentation methods are not real-time capable. Hence,sheikh2016prop
proposed to use quad-tree based super pixels where only the center is classified by a random forest and labels are propagated to a new image if the super pixels location and intensity do not change significantly. This inherently assumes small inter frame motion but does not take spatial correspondences into account, yet runs on a CPU with up to 30 fps. While image segmentation is fairly advanced, labeling point clouds still has a large potential for improvement. Voxel-based approaches like OctNet(riegler2017octnet) precompute a voxel grid and apply 3D- convolutions. Most grid cells are empty for sparse LIDAR point clouds. Hence, recent research shifts towards using points directly (qi2017pointnet; qi2017pointnetpp), forming cluster of points (landrieu2017spg), applying convolutions on local surfaces (tatarchenko2018tangentconv) or lifting points to a high-dimensional sparse lattice (su2018splatnet). These methods do not enforce consistent labels for sequential data and would need to be recomputed once new data is aggregated while being strongly memory constrained. Nevertheless, zaganidis2018segicp showed that semantic predictions can improve point cloud registration with GICP and NDT. Semantic reconstruction and mapping received much attention in recent years. civera2011towards, for example, paved the way towards a semantic SLAM system by presenting an object reasoning system, able to learn object models using feature descriptors in an offline step and then recognizing and registering them to the map at run time. However, their system was limited to a small number of objects and apart from the recognized objects, the map was represented only as a sparse point cloud. bao2011semantic exploit semantics for Structure-from-Motion (SfM) to reduce the initial number of possible camera configurations and add a semantic term during Maximum-Likelihood estimation of camera poses and scene structure. Subsequently, bao2013dense use the estimated scene structure to generate dense reconstructions from learned class-specific mean shapes with anchor points. The mean shape is warped with a 3D thin plate spline and local displacements are obtained from actual details of the instance. Instead, hane2013joint
fuse single frame depth maps from plane sweep stereo to reconstruct a uniform voxel grid and jointly label these voxels by rephrasing the fusion as a multi-label assignment problem. A primal-dual algorithm solves the assignment while penalizing the transition between two classes based on class-specific geometry priors for surface orientation. Their method is also able to reconstruct and label underlying voxels and not only visible ones. In subsequent work, more elaborate geometry priors have been learned, e.g. using Wolff shapes from surface normal distributions(hane2014wolff) and recently end-to-end-learned with a 3D-CNN (cherabier2018learn). The data term in the optimization has been improved (savinov16visconst), memory consumption and runtime reduced (cherabier2016multi; blaha2016large), and an alignment to shape priors integrated (maninchedda2016head). schonberger2018semloc utilize the approach of hane2013joint for visual localization with semantic assistance to learn descriptors. An encoder-decoder CNN is trained on the auxiliary task of Scene Completion given incomplete subvolumes. The encoder is then used for descriptor estimation. Given a bag of words with a corresponding vocabulary one can thus query matching images for a given input frame. For incremental reconstruction, stuckler2014semantic presented a densely- represented approach using a voxel grid map. Semantic labels were generated for individual RGB-D views of the modeled scene by a random forest. Labels were then assigned projectively to each occupied voxel and fused using a Bayesian update. The update effectively improved the accuracy of backprojected labels compared to instantaneous segmentation of individual RGB-D views. Similarly, hermans2014dense fused semantic information obtained from segmenting RGB-D frames using random forests but represented the map as a point cloud. Their main contribution was an efficient spatial regularizing Conditional Random Field (CRF), which smoothes semantic labels throughout the point cloud. li2016semi extended this approach to monocular video while using the semi-dense map of LSD-SLAM (engel2014lsd). Here, the DeepLab-CNN (chen2018deeplab) was used instead of a random forest for segmentation. vineet2015incremental achieve a virtually unbounded scene reconstruction through the use of an efficient voxel hashed data structure for the map. This further allows them to incrementally reconstruct the scene. Instead of RGB-D cameras, stereo cameras were employed and depth was estimated by stereo disparity. Semantic segmentation was performed through random forest. The requirement for dense depth estimates is lifted in the approach of kundu2014joint. They use only sparse triangulated points obtained through monocular Visual SLAM and recover a dense volumetric map through a CRF that jointly infers semantic category and occupancy for each voxel. A different approach is used in the keyframe-based monocular SLAM system by tateno2017cnn where a CNN predicts per keyframe the pixel-wise monocular depth and semantic labels. lianos2018vso reduce drift in visual odometry via establishing of semantic correspondences over longer periods than possible with pure visual correspondences. The intuition is that the semantic class of a car will stay a car even under diverse illumination and view point changes while visual correspondences may be lost. However, semantic correspondences are not discriminative in the short term. tulsiani2017multi perform single view reconstruction of a dense voxel grid with a CNN. During training multiple views of the same scene guide the learning by enforcing the consistency of viewing rays incorporating information from multiple sources like foreground masks, depth, color or semantics. Whereas, ma2017multiview examined the use of warping RGB-D image sequences into a reference frame for semantic segmentation to obtain more consistent predictions. sun2018roctomap extend OctoMap (hornung2013octomap) with a LSTM per cell to be able to account for long term changes like dynamic obstacles. nakajima2018geosem segment surfels from the depth image and semantic prediction using connected component analysis and further refined incrementally over time. Geometric segments along with their semantic label are stored in the 3D map. The probabilistic fusion combines the rendered current view with the current frame and its low resolutional semantics. Surfels are also used in the work of mccormac2017semanticfusion. The authors integrated semantics into ElasticFusion (whelan2015elasticfusion) which represents the environment as a dense surfel map. ElasticFusion is able to reconstruct the environment in real-time on a GPU given RGB-D images and can handle local as well as global loop closure. Semantic information is stored on a per-surfel basis. Inference is done by an RGB-D-CNN before fusing estimates probabilistically. SemanticFusion fuses for each visible surfel and all possible classes which is very time- and memory-consuming since the class probabilities need to be stored per surfel and class on the GPU. Objects normally consist of a large number of surfels and share in reality a single class label even though semantic information within a surfel would only be required at the border of the object where the class is likely to change. Hence, many surfels store the same redundant information and since GPU memory is notoriously limited, memory usage becomes a problem for larger surfel maps. Furthermore, SemanticFusion tends to create many unnecessary surfels with differing scales and labels for the same surface when sensed from different distances. Closely related to our approach is the work of valentin2013mesh. Their map is represented as a triangular mesh. They aggregate depth images in a Truncated Signed Distance Function (TSDF) and obtain the explicit mesh representation via the marching cubes algorithm. Afterwards, semantic inference is performed for each triangle independently using a learned classifier on an aggregation of photometric (color dependent) and handcrafted local geometric (mesh related) features. Spatial regularization is ensured through a CRF over the mesh faces. Their classifier infers the label with all visible pixels per face at once and is not designed to incrementally fuse new information. Furthermore, the pairwise potential of the CRF does not take the likelihood for other classes in to account. Especially around object borders this may lead to suboptimal results. The semantic resolution is tied to the geometry of the mesh, hence to have fine details the mesh resolution needs to be fine grained. Geometrically a wall can be described with a small number of vertices and faces, but to semantically distinguish between an attached poster and the wall itself the mesh would need a high resolution. In comparison, we only store the likelihood for a small number of most probable classes and the meshing creates a single simplified surface while the texture resolution can be chosen independent of the geometry yet appropriate to the scene. During fusion we further include weighting to account for the sensor distance and in the case of color integration include vignetting and viewing angle.
In this paper, we present a novel approach to building semantic maps by decoupling the geometry of the environment from its semantics by using semantic textured meshes. This decoupling allows us to store the geometry of the scene as a lightweight mesh which efficiently represents even city-sized environments. Our method (see Fig. 2) operates in three steps: mesh generation, semantic texturing and label propagation. In the mesh generation step, we create a mesh of the environment by aggregating the individual point clouds recorded by a laser scanner or an RGB-D camera. We assume that the scans are preregistered into a common reference frame using any off-the-shelf SLAM system. We calculate the normals for the points in each scan by estimating an edge-maintaining local mesh for the scan. Once the full point cloud equipped with normals is aggregated, we extract a mesh using Poisson reconstruction (kazhdan2013screened) and further simplify it using QSlim (garland1998qslim). Our main contribution for 3D reconstruction is the proposal of system capable of fast normal estimation by using a local mesh and also local line simplification which heavily reduces the number of points, therefore reducing the time and memory used by Poisson reconstruction. In the semantic texturing step, we first prepare the mesh for texturing by parameterizing it into a 2D plane. Seams and cuts are added to the mesh in order to deform it into a planar domain. A semantic texture is created in which the number of channels corresponds to the number of semantic classes. The semantic segmentation of each individual RGB frame is inferred by RefineNet and fused probabilistically into the semantic texture. We ensure bounded memory usage on the GPU by dynamically allocating and deallocating parts of the semantic texture as needed. Additionally, the RGB information is fused in an RGB texture. In the Label Propagation step, we project the stable semantics, stored in the textured mesh, back into the camera frames and retrain the predictor in a semi-supervised manner using high confidence fused labels as ground truth, allowing the segmentation to learn from novel view points. Hence, the contribution presented in this article is fourfold:
a scalable system for building accurate meshes from range measurements with coupled geometry and semantics at independent resolution,
an edge-maintaining local mesh generation from lidar scans,
a label propagation that ensures temporal and spatial consistency of the semantic predictions, which helps the semantic segmentation to learn and perform segmentation from novel view points,
fast integration of probability maps by leveraging the GPU with bounded memory usage.
In the following, we will denote matrices by bold uppercase letters and (column-)vectors with bold lowercase letters. The rigid transformationis represented as matrix and maps points from coordinate frame to coordinate frame by operating on homogeneous coordinates. When necessary, the frame in which a point is expressed is added as a subscript: e.g. for points in world coordinates. A point is projected into frame with the pose and the camera matrix . For the camera matrix, we assume a standard pinhole model with focal length , and principal point . The projection of into image coordinates is given by the following mapping:
An image or a texture is denoted by , where maps from pixel coordinates to -channel values.
The input to our system is a sequence of organized point clouds 111An organized point cloud exhibits an image resembling structure, e.g. from commodity RGB-D sensors., and RGB images ( indicates the time step). We assume that the point clouds are already registered into a common reference frame, and the extrinsic calibration from depth sensor to camera, as well as camera matrices, are given. The depth sensor can be an RGB-D camera or a laser scanner. The output of our system is threefold:
a triangular mesh of the scene geometry, defined as a tuple of vertices and faces . Each vertex contains a 3D point and a UV texture coordinate, while the mesh face is represented by the indices of the three spanning vertices within .
a semantic texture indicating the texels class probabilities,
an RGB texture representing the surface appearance.
After describing the necessary depth preprocessing in Sec. 5.1, we will explain in detail the mesh generation and parametrization (Sec. LABEL:sec:Mesh), before elaborating on the semantic (Sec. LABEL:sec:Semantic) and color integration (Sec. LABEL:sec:Color), sparse representation (Sec. LABEL:sec:SparseSemantic) and label propagation (Sec. LABEL:sec:LabelPropagation).
5.1 Depth Preprocessing
As previously mentioned, our system constructs a global mesh from the aggregation of a series of point clouds recorded from a depth sensor. Many surface reconstruction algorithms require accurate per-point normal. One way to obtain these normals is by aggregating the full global point cloud, and using the k-nearest-neighbors to estimate the normals for each point. However, this would be prohibitively slow as it requires a spatial subdivision structure, like a Kd-tree, which can easily grow to a considerable size for large point clouds, limiting the scalability of the system. For fast normal estimation we take advantage of the structure of the recorded point cloud. Since depth from an RGB-D sensor is typically structured as an image, we can easily query adjacent neighboring points. Similarly, rotating lidar sensors can produce organized scans. A complete revolution of e.g. a Velodyne VLP-16 produces a 2D array of size containing the measured range of each recorded point, where is determined by the speed of revolution of the laser scanner.