High-quality Instance-aware Semantic 3D Map Using RGB-D Camera

03/26/2019 ∙ by Dinh-Cuong Hoang, et al. ∙ 0

We present a mapping system capable of constructing detailed instance-level semantic models of room-sized indoor environments by means of an RGB-D camera. In this work, we integrate deep-learning based instance segmentation and classification into a state of the art RGB-D SLAM system. We leverage the pipeline of ElasticFusion whelan2016elasticfusion as a backbone, and propose modifications of the registration cost function to make full use of the instance class labels in the process. The proposed objective function features tunable weights for the depth, appearance, and semantic information channels, which can be learned from data. The resulting system is capable of producing accurate semantic maps of room-sized environments, as well as reconstructing highly detailed object-level models. The developed method has been verified through experimental validation on the TUM RGB-D SLAM benchmark and the YCB video dataset. Our results confirmed that the proposed system performs favorably in terms of trajectory estimation, surface reconstruction, and segmentation quality in comparison to other state-of-the-art systems.



There are no comments yet.


page 1

page 3

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the advent of 3D measurement devices using structured light sensing such as affordable RGB-D cameras like the ASUS Xtion Pro Live or Microsoft’s Kinect, research on SLAM (Simultaneous Localization and Mapping) has made giant strides in development

[1, 2, 3]. These approaches achieve dense surface reconstruction of complex indoor scenes while maintaining real-time performance through implementations on highly parallelized hardware. Beyond classical SLAM systems which solely provide a purely geometric map, the idea of a system that generates a dense map in which object instances are semantically annotated has attracted substantial interest in the research community [4, 5, 6, 7, 8]. An instance-aware semantic 3D map is useful for enabling more context-aware and more intelligent robot behaviors.

Fig. 1: An instance-aware semantic 3D map of our office produced by the proposed mapping system.

In this study, we propose a 3D mapping system to produce highly accurate object-aware semantic scene reconstruction. Our work benefits from incorporating state of the art RGB-D SLAM and deep-learning-based instance segmentation techniques [1, 9]. Unlike previous related works [6, 7, 8]

, which solely use semantic information for data fusion, we employ rich segmentation information from CNNs to increase the robustness of camera tracking through a joint cost function wherein all given information is used: the depth, RGB image, and segmentation information. We also develop a CNN architecture beyond the original Mask R-CNN to input RGB image and output adaptive weights for the cost function used in the sensor pose estimation process. In contrast to existing approaches that update the probabilities for all elements (surfels or voxels) in the 3D map, we reduce the space complexity by a more efficient strategy based on instance labels. In addition to the highly accurate semantic scene reconstruction, we correct misclassified regions using two proposed criteria which rely on location information and pixel-wise probability of the class. We evaluate the performance of our system on the TUM RGB-SLAM benchmark and the YCB video dataset

[10] and show that our system benefits greatly from the use of the proposed joint cost function with adaptive weights. The developed system performs on par with the state of the art in terms of camera trajectory estimation while generating accurate object instance models. We also show that our approach leads to an improvement in the 2D instance labeling over baseline single frame predictions.

Ii Related Work

Ii-a Registration of RGB-D Images

A large number of registration algorithms have been proposed in the context of RGB-D Tracking and Mapping (TAM) [1, 2, 3, 11, 12]. Feature-based approaches estimate the sensor pose by only considering informative and characteristic points known as key points [11, 12]. Alternatively, dense geometric tracking approaches, such as KinectFusion [2], typically apply an ICP [13] variant to directly register the full depth image to an online reconstructed volumetric model. The original KinectFusion algorithm uses a Truncated Signed Distance Function (TSDF) [14] for model representation and point-to-plane ICP [13] for alignment. Several alternatives to this choice of algorithms have been proposed [15, 16, 17], which are expected to perform better in regions where the point-to-plane distance is ill-defined.

Using only depth data, tracking failure can occur in situations where the amount of characteristic features in the depth map is low. Steinbrucker et al. [18] introduced an energy minimization approach for RGB-D image registration that relies on color information instead. In comparison with geometric ICP, the authors reported that their method is more accurate in the regime of small camera motions. Whelan et al. [19] combined the color and depth information in the cost function so that all given information is used. They demonstrated that this combination increases the robustness of camera tracking across a variety of environments. This idea was further used in ElasticFusion [1] which fuses measurements and uses a surfel structure instead of volumetric one for reconstruction. ElasticFusion demonstrates the capability to produce globally consistent reconstructions in real-time without the use of post-processing steps. Similarly to Elastic Fusion, our approach also integrates both geometric and photometric cues for camera tracking.

Ii-B Semantic Mapping

Fusing semantic along with geometry information within a 3D reconstructed map is a promising approach to enable intelligent systems to better understand a 3D scene. A number of semantic mapping systems have been developed [6, 20, 21]. Hermans et al. [20] utilize Random Decision Forests to achieve semantic pixel-wise image labeling and fuse them in a classic Bayesian framework. Previous work by McCormac et al. [6]

aimed towards a useful semantic 3D map by combining the advantages of Convolutional Neural Networks (CNNs) and ElasticFusion

[1]. The correspondences between frames are estimated by the SLAM system. Meanwhile, their CNN architecture adopts a Deconvolutional Semantic Segmentation network [22] to generate a pixel-wise semantic map for incoming images. Unlike the original architecture [22], this system incorporates depth information to obtain a higher accuracy than the pretrained RGB network. The authors reported that fusing multiple predictions led to a significant improvement in the semantic labeling and it is the first real-time capable approach suitable for interactive indoor scene scanning and labeling. Likewise, SegICP-DSR [21] fuses RGB-D observations into a semantically-labeled point cloud for object pose estimation using adversarial networks and ElasticFusion. There is, however, one significant difference. SegICP-DSR employs the semantic label difference instead of a photometric error when formulating the alignment objective function. Then, a semantically-labeled point cloud can be directly outputted from the reconstruction process without an extra update step. Obviously, the addition of semantic information enables a much greater range of functionality than geometry alone. However, since the above systems only consider class labels, they are limited to scenarios with single object instances per scene and may degenerate performance in case multiple objects of the same type are present.

MaskFusion [8] is a real-time, object-aware, semantic and dynamic RGB-D SLAM system. It combines geometric segmentation running on every frame and semantic segmentation using Mask R-CNN computed for select keyframes. The geometric segmentation algorithm acquires object boundaries based on an analysis of depth discontinuities and surface normals, while Mask R-CNN is used to provide object masks with semantic labels. Camera poses are estimated by minimizing a joint geometric and photometric error function as presented in [1]. The reported results demonstrate that while MaskFusion outperforms a set of baseline state of the art algorithms in highly dynamic scenes, ElasticFusion performs best on static and moderately dynamic scenes. Unlike MaskFusion, our system is not designed to deal with dynamic scenes and real-time operation. Instead, we assume all objects to be static during an observation and aim to reconstruct high-quality object models.

Most related to ours is the work of McCormac et al. Fusion++ [7], which aims to produce multiple semantically labeled maps of object instances without a dense representation of the entire static scene. Fusion++ uses Mask R-CNN instance segmentation to initialize dense per-object TSDF reconstructions with object size-dependent resolutions. For camera tracking, Fusion++ takes an approach similar to KinectFusion using projective data association and a point-to-plane error. Note that apart from object level maps, Fusion++ also maintains a coarse background TSDF to assist frame-to-model tracking. While the authors evaluated the trajectory error of the developed system against the baseline approach of simple coarse TSDF odometry, the reports did not provide a comparison with other photometry or semantics-aware state of the art approaches.

Though we aim towards multiple object level maps, there are clear differences between Fusion++ and our work. While Fusion++ only uses depth information for motion tracking. Our approach increases the robustness of sensor tracking through integrating semantic, appearance, and geometric cues into the reconstruction process. In addition, our CNN network is able to generate adaptive weights for the joint cost function which plays a big role in robust camera pose tracking.

Iii Methodology

Fig. 2: Flow of the proposed framework: The segmentation network firstly yields masks and probabilities specified for each category. Then the output of the segmentation stage along with depth map and RGB frame is used for camera pose estimation. Finally, input data and semantic information are fused into the 3D map based on the transformation matrix estimated from the previous stage.
Fig. 3: CNN architecture: Extending Mask R-CNN to predict masks and classes probabilities while simultaneously yielding an adaptive weight for camera tracking.

Our pipeline is composed of three main components as illustrated in Fig. 2. The input RGB-D data is processed through a semantic instance segmentation stage, followed by a frame-to-model alignment stage, and finally a model fusion stage. In the following, we summarise the key elements of our method.

Segmentation: Produce object masks with semantic labels using our CNN architecture. The developed architecture also predicts weights for the joint cost function for camera tracking.

Tracking: Estimate camera poses within the ElasticFusion pipeline using our proposed joint cost function. We combine the cost functions of geometric, photometric, and semantic estimates in a weighted sum. The adaptive weights are generated by the segmentation process above.

Fusion and Segmentation Improvement: Fuse segmentation information into 3D map using our instance-based semantic fusion. To improve segmentation accuracy, misclassified regions are corrected by two criteria which rely on a sequence of CNN predictions.

Iii-a Segmentation Network

In our framework, we employ an end-to-end CNN framework, Mask R-CNN for generating a high-quality segmentation mask for each instance. Mask R-CNN has three outputs for each candidate object, a class label, a bounding-box offset, and a mask. Its procedure consists of two stages. In the first stage, candidate object bounding boxes are proposed by a Region Proposal Network (RPN). In the second stage, classification, bounding-box regression, and mask prediction are performed in parallel on each small feature map, which is extracted by RoIPool. Note that to speed up inference and improve accuracy the mask branch is applied to the highest scoring 100 detection boxes after running the box prediction. The mask branch predicts a binary mask from each RoI using an FCN architecture [23]. The binary mask is a single

output regardless of class, which is generated by binarizing the floating-number mask or soft mask at a threshold of 0.5.

To integrate deep-learning based segmentation and classification into our system, we extend Mask R-CNN to identify object outlines at the pixel level while simultaneously generating an adaptive weight used in camera pose tracking stage as shown in Fig. 3

. A fourth branch is added to our CNN framework, which shares computation of feature maps with Mask R-CNN branches and outputs the weight by a fully connected layer. In general, the network consists of a backbone CNN, a region proposal network (RPN), an ROI classifier, a bounding Box Regressor, a mask branch, and a camera tracking weight estimator. The CNN backbone is a standard convolutional neural network that is used for extracting a feature map. This convolutional feature map not only becomes the input for the other stages of Mask R-CNN but also shares computation with our extended branch for adaptive weight estimation. Therefore, the developed network receives an RGB image and returns a set of per-pixel class probabilities and weights used in the cost function in the subsequent alignment stage. The weight estimation is treated as a classification problem where the target is a binary decision whether or not the given RGB image should be used in the registration process. In other words, we aim to train our weight predicting model as a binary classifier, where one class signifies that the RGB image contains useful information for the subsequent registration process, while the other class indicates the converse. The probability predicted from the classification model is considered as an adaptive weight for our joint cost function for camera pose estimation.

Iii-B Camera Tracking

To perform camera tracking, our object-oriented mapping system maintains a fused surfel-based model of the environment (similar to the model used by ElasticFusion [1]). Here we borrow and extend the notation proposed in the original ElasticFusion paper. The model is represented by a cloud of surfels , where each surfel consists of of a position , normal , colour , weight , radius , initialisation timestamp and last updated timestamp . In addition, each surfel aslo stores a object instance label . Each object instance

is associated with a discrete probability distribution over potential class labels,

over the set of class labels, .

The image space domain is defined as , where an RGB-D frame is composed of a color map and a depth map of depth pixels . We define the 3D back projection of a point given a depth map as , where is the camera intrinsics matrix and u the homogeneous form of . The perspective projection of a 3D point is defined as , where . In the following, we describe our proposed approach for combined ICP pose estimation.

Our approach aims to estimate a sensor pose that minimizes the cost over a combination of the global point-plane energy, photometric error and semantic difference. We wish to minimize a joint optimization objective:


where , , and are the geometric, photometric and semantic error terms respectively. The photometric error function is weighted by factor predicted from our CNN. The weight for semantic error is defined as , where is the number of non-background pixels and is the number of pixels per frame.

The details of the first two terms in equation (1) can be found in [1]. is the point-to-plane error metric in which the object of minimization is the sum of the squared distance between a point from a live surface measurement and the tangent plane at its correspondence point from the model prediction. The cost function performs well in environments with hight geometric texture, however tracking failures can occur in case there are not enough features to fully constrain all 6DOF of the camera pose. For instance, if the measured points are located on planar surfaces then the point-to-plane error metric will fail to register successive views. This is because there will be no mechanism to guarantee that a global minimum can be reached by shifting source points to target points in the direction perpendicular to the normals. Steinbrücker et al. [18] used color information to overcome this. is the cost over the photometric error between the current color image and the predicted model color from the last frame.

A key distinction between our approach and ElasticFusion is that instead of only estimating camera pose via geometric and photometric data, we additionally employ semantic information to perform frame-to-frame tracking. The cost we wish to minimize depends on the difference in predicted likelihood values between the label probability maps. To simplify minimizing the cost function, we only take the probability of the most likely class on each pixel-wise probability vector

. We denote values of over a given image as semantic probability map. So based on this simplification, the semantic probability error can be formulated as:


In words, and are per-pixel independent probability distributions over the class labels from the frame at time step and respectively. The vector is the warped pixel and defined according to the incremental transformation :


Finally, we find the transformation by minimizing the objective (1) through the Gauss-Newton non-linear least-square method with a three-level coarse-to-fine pyramid scheme.

Iii-C Fusion and Segmentation Improvement

Class labels fusion: Given an RGB-D frame at time step , each mask from Mask R-CNN must be corresponded to an instance in the 3D map. Otherwise, it will be assigned as a new instance. To find the corresponding instance, we use the tracked camera pose and existing instances in the map built at time step to predict binary masks via splatted rendering. The percent overlap between the mask and a predicted mask for object instance is computed as . Then the mask is mapped to object instance which has the predicted mask with largest overlap, where .

Unlike existing works [6, 7, 8] that each element constituting 3D map such as surfel or TSDF stores a probability distribution over all classes, we propose to assign an object instance label to each surfel and then this label is associated with a discrete probability distribution over potential class labels, over the set of class labels, . In consequence, we need only one probability vector for all surfels belonging to the same object entity. This makes a big difference when the number of surfels is much larger than the number of classes. To update the class probability distribution, means of a recursive Bayesian update is used in [20]. However, this scheme often results in an overly confident class probability distribution that contains scores unsuitable for ranking in object detection [7]. In order to make the distribution become more even, we update the class probability by simple averaging:


Moreover, previous related works miss the background/object probability from the binary mask branch that predicts which pixels correspond to the main classes (non-background), and which pixels correspond to the background. Conversely, we enrich segmentation information on each surfel by adding the probability to account for background/object predictions. To that end, each surfel in our 3D map has a non-background probability attribute .

As presented in [9] the binary mask branch first generates a floating-number mask which is then resized to the RoI size, and binarized at a threshold of 0.5. Therefore, we are able to extract a per-pixel non-background probability map with the same image size . Given the RGB-D frame at time step , a non-background probability is assigned to each pixel. Camera tracking and the 3D back projection introduced in section III-B enables us to update all the surfels with the corresponding probability as following:


Segmentation Improvement: Despite the power and flexibility of Mask R-CNN, it usually misclassified object boundary regions as background. In other words, the detailed structures of an object are often lost or smoothed. Thus, there is still much room for improvement in segmentation. We observe that many of the pixels in the misclassified regions have non-background probability just slightly smaller than 0.5, while the soft probabilities mask for real background pixel is often far below the threshold. Based on this observation, we expect to achieve a more accurate object-aware semantic scene reconstruction by considering non-background probability of surfels within a frames sequence. With this goal, each possible surfel () is associated with a confidence . If a surfel is identified for the first time, its associated confidence is initialized to zero. Then, when a new frame arrives, we increment the confidence only if the corresponding pixel of that surfel satisfies 2 criteria: (i) its non-background probability is greater than 0.4; (ii) there is at least one object pixel inside its 6-neighborhood. After frames, if the confidence exceeds the threshold , we assign surfel to the closest instance. Otherwise, is reset to zero. Here, we found and provide good performance.

Iv Experiments

We have evaluated our system by performing experiments on the TUM [24] and YCB video [10] datasets. These experiments are aimed at evaluating both trajectory estimation and surface reconstruction accuracy. A comparison against most related works is also performed here.

For all tests, we run our system on a standard desktop PC running 64-bit Ubuntu 16.04 Linux with an Intel Core i7-4770K 3.5GHz and a nVidia GeForce GTX 1080 Ti 6GB GPU. Our pipeline is implemented in Python with Tensorflow 1.6 for segmentation and C++ with CUDA for mapping. The input is standard

resolution RGB-D video.

To train our CNNs, We start with a weights file that’s been trained on the ImageNet dataset

[25] with a ResNet-101 [26] backbone. We finetune layers of Mask R-CNN on COCO dataset with 10 common object classes in indoor environments (backpack, chair, keyboard, laptop, monitor, computer mouse, cell phone, sink, refrigerator, microwave) and on a portion of the YCB video data set not used in the evaluations. To train the weight estimator branch, we split SceneNN dataset [27] into two groups based on camera pose ground truth and trajectory estimation of ElasticFusion using only photometric error.

Iv-a Camera Pose Tracking

We compare the trajectory estimation performance of our system to two most related works MaskFusion and Fusion++ on the widely used RGB-D benchmark of [24]. This benchmark is one of the most popular datasets for the evaluation of RGB-D SLAM systems. The dataset covers a large variety of scenes and camera motions and provides sequences for debugging with slow motions as well as longer trajectories with and without loop closures. Each sequence contains the color and depth images, as well as the ground-truth trajectory from the motion capture system. To evaluate the error in the estimated trajectory by comparing it with the ground-truth, we adopt the absolute trajectory error (ATE) root-mean-square error metric (RMSE) as proposed in [24].

PCL-KinFu Kintinuous ElasticFusion MaskFusion Fusion++ Our System
freiburg1_desk 0.073 0.037 0.020 0.034 0.049 0.022
freiburg1_room 0.187 0.075 0.068 0.153 0.235 0.065
freiburg1_desk2 0.102 0.071 0.048 0.093 0.153 0.056
freiburg1_360 - 0.116 0.108 0.157 - 0.126
freiburg1_teddy - 0.132 0.083 0.129 - 0.095
freiburg2_desk 0.103 0.034 0.071 0.108 0.114 0.083
freiburg2_xyz 0.077 0.029 0.011 0.041 0.020 0.025
freiburg2_rpy - 0.018 0.015 0.076 - 0.012
freiburg3_long_office_household 0.086 0.030 0.017 0.102 0.108 0.085
freiburg3_large_cabinet - 0.144 0.099 0.133 - 0.052
TABLE I: Comparison of ATE RMSE on RGB-D SLAM benchmark. All units given are in metres.

Table I shows the results. From these we can see the performance of our system is comparable to state of the art classical approaches, and outperforms both MaskFusion and Fusion++. Results for Fusion++ are taken from the respective publication as presented by the authors, and values for MaskFusion are calculated from MaskFusion implementation. While the original ElasticFusion algorithm still obtains the best overall ATE performance, the results of our approach are comparable. Despite this relative similarity in the average trajectories, our approach performs better in reconstructing the relevant object-scale detail, as discussed in the next sub-section.

Iv-B Reconstruction

(a) 0007-EF
(b) 0036-EF
(c) 0072-EF
(d) 01-EF
(e) 02-EF
(f) 03-EF
(g) 0007-MF
(h) 0036-MF
(i) 0072-MF
(j) 01-MF
(k) 02-MF
(l) 03-MF
(m) 0007-Proposed
(n) 0036-Proposed
(o) 0072-Proposed
(p) 01-Proposed
(q) 02-Proposed
(r) 03-Proposed
Fig. 4: Heat maps showing reconstruction error of ElasticFusion (EF), MaskFusion (MF), and our proposed system on the remaining test videos from The YCB-video dataset (0007, 0036, 0072) and three video sequences (01, 02, 03) we acquired in independently in scenes featuring a larger number of objects and more complex camera trajectories.
ElasticFusion MaskFusion Our System
YCB video 0007 9.6 7.3 6.5
YCB video 0036 8.1 6.4 5.7
YCB video 0072 10.1 9.4 8.7
Our sequence 01 7.1 6.7 3.7
Our sequence 02 7.3 6.6 4.1
Our sequence 03 7.5 6.2 3.4
TABLE II: Comparison of surface reconstruction accuracy results on the YCB objects (mm).

It should be noted that a good performance on a camera trajectory benchmark does not always imply a high quality surface reconstruction. We have evaluated our system by performing experiments on Yale-CMU-Berkeley (YCB) Object and Model set [28]. We finetuned our network on the training set of the YCB-Video dataset. It contains 92 real video sequences for 21 object instances. 89 videos along with 80,000 synthetic images are used for training. We evaluate on the remaining test videos from the original data set, as well as on three video sequences we acquired independently in scenes featuring a larger number of objects and more complex camera trajectories.

In order to evaluate surface reconstruction quality, we compare the object models obtained through our approach to the ground truth YCB object models. Note that the ground truth object models are only used here to compute evaluation metrics, unlike in prior works like SLAM++ which use them within the tracking framework. For every object present in the scene, we first register the reconstructed model M to the ground truth model G. Next, we project every vertex from M onto G, and compute the distance between the original vertex and it’s projection. Finally, we calculate and report the mean distance

over all model points and all objects.

Table II and Fig. 4 show the mean reconstruction error over the six sequences produced by our system, MaskFusion and ElasticFusion. Our method consistently results in the lowest reconstruction errors over all datasets. From this comparison it is evident that our approach benefits greatly from the use of the proposed joint cost function with adaptive weights. Interestingly we observe an increase in accuracy is achieved when more segmented objects appeared in the reconstructed environment, suggesting that our framework makes efficient use of the available semantic information to improve surface reconstruction quality. In other words, when the number of objects of interest increases the semantic probability map becomes more textured, which leads to a better reconstruction performance. We also note that in our approach all surfels on objects of interest are always active, while ElasticFusion segments these surfels into inactive area if they have not been observed for a period of time . This means that object surfels are updated all the time. As a result, our framework is able to produce a highly accurate object-oriented semantic map.

Iv-C Segmentation Accuracy Evaluation

In this section, we show on the YCB video dataset that our system leads to an improvement in the 2D instance labeling over the baseline single frame predictions generated by Mask R-CNN. Our 2D masks are obtained by reprojecting the reconstructed 3D model. We use the Intersection over Union (IoU) metric for this evaluation, which measures the number of pixels common between the ground-truth and prediction masks divided by the total number of pixels present across both masks. The results of this evaluation are shown in Fig. 5. We observe the segmentation performance improved, on average, from 63.5% for a single frame to 83.4% when projecting the predictions from the 3D map.

Fig. 5: Results of segmentation accuracy evaluation on YCB videos.

Iv-D Run-time Performance and Memory Usage

Fig. 6: Memory usage for storing class probabilities.

Run-time Performance: Our current system does not run in real time because of heavy computation in instance segmentation. Our CNN requires 350ms, while camera pose estimation, fusion, and segmentation require a further 70ms on a typical sample of RGB-D SLAM Benchmark [24].

Memory Usage: We compared our mask-based fusion method with other approaches [6, 7, 8] which assign class probabilities to each element of the 3D map rather than to each mask. The memory usage of the proposed method is significantly reduced compared to the conventional approach over all samples as shown in Fig. 6. The average memory usage of the proposed method is 5.7% of those conventional approaches.

V Conclusions

In this paper we have presented a 3D mapping system for RGB-D camera pose tracking that yields high quality object-oriented semantic reconstruction. Our system is based on incorporating state of the art RGB-D SLAM and deep-learning-based instance segmentation and classification. Our main contribution in this paper is to show that by combining geometric, appearance, and semantic cues in camera pose tracking we are able to obtain reliable camera tracking and state of the art surface reconstruction in small-scale environments populated with objects of interest. In addition, we propose an approach to improve segmentation accuracy and reduce memory usage for storing class probability distribution. We have provided an extensive evaluation on common benchmarks and our own dataset. The results confirm that the developed system performs strongly in terms of sensor pose estimation, surface reconstruction, and segmentation in comparison to other state-of-the-art systems. As future work, we plan on incorporating depth images in the Mask R-CNN pipeline and on reducing the runtime requirements of the proposed system. Devising methods for automatic tuning of the ICP and semantic weights in our camera tracking objective function are also promising directions for further investigation.


  • [1] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger, “Elasticfusion: Real-time dense slam and light source estimation,” The International Journal of Robotics Research, vol. 35, no. 14, pp. 1697–1716, 2016.
  • [2] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on.   IEEE, 2011, pp. 127–136.
  • [3] D. R. Canelhas, T. Stoyanov, and A. J. Lilienthal, “Sdf tracker: A parallel algorithm for on-line pose estimation and scene reconstruction from depth images,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on.   IEEE, 2013, pp. 3671–3676.
  • [4] Y. Nakajima and H. Saito, “Efficient object-oriented semantic mapping with object detector,” IEEE Access, vol. 7, pp. 3206–3213, 2019.
  • [5] N. Sünderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid, “Meaningful maps with object-oriented semantic mapping,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 5079–5085.
  • [6] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Semanticfusion: Dense 3d semantic mapping with convolutional neural networks,” in 2017 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2017, pp. 4628–4635.
  • [7] J. McCormac, R. Clark, M. Bloesch, A. J. Davison, and S. Leutenegger, “Fusion++: Volumetric object-level slam,” arXiv preprint arXiv:1808.08378, 2018.
  • [8] M. Rünz and L. Agapito, “Maskfusion: Real-time recognition, tracking and reconstruction of multiple moving objects,” arXiv preprint arXiv:1804.09194, 2018.
  • [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Computer Vision (ICCV), 2017 IEEE International Conference on.   IEEE, 2017, pp. 2980–2988.
  • [10] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” arXiv preprint arXiv:1711.00199, 2017.
  • [11] A. S. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox, and N. Roy, “Visual odometry and mapping for autonomous flight using an rgb-d camera,” in Robotics Research.   Springer, 2017, pp. 235–252.
  • [12] F. Endres, J. Hess, N. Engelhard, J. Sturm, D. Cremers, and W. Burgard, “An evaluation of the rgb-d slam system,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on.   IEEE, 2012, pp. 1691–1696.
  • [13] Y. Chen and G. Medioni, “Object modelling by registration of multiple range images,” Image and vision computing, vol. 10, no. 3, pp. 145–155, 1992.
  • [14] B. Curless and M. Levoy, “A volumetric method for building complex models from range images,” in Proceedings of the 23rd annual conference on Computer graphics and interactive techniques.   ACM, 1996, pp. 303–312.
  • [15] T. Whelan, M. Kaess, M. Fallon, H. Johannsson, J. Leonard, and J. McDonald, “Kintinuous: Spatially extended kinectfusion,” 2012.
  • [16] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, and J. McDonald, “Real-time large-scale dense rgb-d slam with volumetric fusion,” The International Journal of Robotics Research, vol. 34, no. 4-5, pp. 598–626, 2015.
  • [17]

    F. I. I. Muñoz and A. I. Comport, “Point-to-hyperplane rgb-d pose estimation: Fusing photometric and geometric measurements,” in

    2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2016, pp. 24–29.
  • [18] F. Steinbrücker, J. Sturm, and D. Cremers, “Real-time visual odometry from dense rgb-d images,” in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on.   IEEE, 2011, pp. 719–722.
  • [19] T. Whelan, H. Johannsson, M. Kaess, J. J. Leonard, and J. McDonald, “Robust real-time visual odometry for dense rgb-d mapping,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on.   IEEE, 2013, pp. 5724–5731.
  • [20] A. Hermans, G. Floros, and B. Leibe, “Dense 3d semantic mapping of indoor scenes from rgb-d images,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on.   IEEE, 2014, pp. 2631–2638.
  • [21] J. M. Wong, S. Wagner, C. Lawson, V. Kee, M. Hebert, J. Rooney, G.-L. Mariottini, R. Russell, A. Schneider, R. Chipalkatty, et al., “Segicp-dsr: Dense semantic scene reconstruction and registration,” arXiv preprint arXiv:1711.02216, 2017.
  • [22] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1520–1528.
  • [23] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2015, pp. 3431–3440.
  • [24] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on.   IEEE, 2012, pp. 573–580.
  • [25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” 2009.
  • [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [27] B.-S. Hua, Q.-H. Pham, D. T. Nguyen, M.-K. Tran, L.-F. Yu, and S.-K. Yeung, “Scenenn: A scene meshes dataset with annotations,” in 2016 Fourth International Conference on 3D Vision (3DV).   IEEE, 2016, pp. 92–101.
  • [28] B. Calli, A. Walsman, A. Singh, S. Srinivasa, P. Abbeel, and A. M. Dollar, “Benchmarking in manipulation research: The ycb object and model set and benchmarking protocols,” arXiv preprint arXiv:1502.03143, 2015.