As the cameras carried by robots or other smart devices move through scenes, their ability to operate and interact intelligently and persistently with their environments will depend on the quality of scene representation which they can build and maintain[Davison:ARXIV2018, Cadena:etal:TRO2016]
. Convolutional Neural Networks (CNNs) have proven highly effective at semantic labelling and have been mainly used in two key approaches for scene labelling, each with its own set of advantages.
View-based labelling of raw input image data and incremental fusion of generated labels into the scene representation: each camera frame is labelled and used together with the geometric correspondence available from a SLAM system to accumulate label estimates in the map. Up-to-date incremental map labels are thus available at frame rate with low latency. As each scene element is seen from multiple varying viewpoints, the independent image labellings contribute to robust fusion and the incremental correction of errors. The labelling CNN runs on “raw” data obtained directly from the camera, and can run at the sensor’s natural resolution. Considering label correlations is computationally intractable and the individual pixel labels are thus generally assumed to be independent. Pixel-wise label fusion of every frame or keyframe during reconstruction offers a scalable solution to adding semantics to a variably sized scene, but can require unnecessary computation in areas which are easily segmented from one view.
One-off map-based labelling of the generated scene model: a single labelling network such as a volumetric 3D CNN is applied to the whole reconstruction produced by a SLAM system, which contains both geometry and appearance information. This approach avoids the redundant work of labelling many overlapping input frames, and can be maximally efficient by operating on each scene element only once. Map-based methods can furthermore take advantage of global context over the whole scene when labelling each part and avoid the need for element-wise label fusion approaches which generally neglect label correlations. Finally, a CNN which learns and labels in canonical map space has much less scene variation to deal with due to rotation and scale changes. Its power will, however, be limited by the quality of the map reconstruction.
Semantic SLAM systems so far employ one or the other of the above approaches and sometimes attempt to combine both (see Section II). To our knowledge, the choice of approach is rarely based on quantitatively determined advantages for the particular use-case. We aim to establish a baseline for comparison of view-based and map-based approaches to open up systematic and quantitative research on their use within real-time SLAM. We have chosen table-top dense semantic object labelling as our experimental domain, using Height Map Fusion [Zienkiewicz:etal:3DV2016] as our mapping approach. This deliberate choice of a system setup is more straightforward than general 3D geometry fusion, where there are many possible choices for 3D shape representation (e.g. point cloud, mesh, volumetric) and their corresponding network architectures for semantic labelling. Fused height maps can be labelled using a standard 2D CNN, allowing us to use networks with the same architecture for both view-based and map-based labelling, ensuring fairness in comparison with the aim that our results are transferable in their interpretation beyond the choices of a specific system. Our experiments are carried out on custom synthetic RGB-D data rendered with the SceneNet RGB-D methodology [McCormac:etal:ICCV2017]. We provide results which compare both labelling quality and computational load for view-based and map-based approaches and present the advantages of each, offering the potential for the design of optimal hybrid methods in the future.
Ii Related work
Preliminary work exists towards joint geometric and semantic SLAM (e.g. [zhi2019scenecode]); yet, these systems are fairly limited in terms of accuracy and scaling. Instead, the majority of state-of-the-art work relies on a sequential geometric reconstruction and frame-wise labelling, followed by semantic fusion. A few other approaches infer labels directly from the geometric 3D reconstruction and some combine both frame-wise and map-based methods. An overview of important work in each area is given below.
Ii-a Incremental View-Based Labelling and Fusion
use a Recurrent Neural Network (RNN) to perform frame-wise segmentation of a sequence of RGB-D frames obtained from KinectFusion[Newcombe:etal:ISMAR2011]. Recently, approaches such as PanopticFusion [narita2019panopticfusion] and [grinvald2019volumetric] extend incremental semantic fusion to object discovery and instance-aware maps. Our view-based method leverages RGB-D images to segment 2D frames that are obtained in real-time from our SLAM system with a CNN. Similarly to [McCormac:etal:ICRA2017], we implement incremental fusion as a Bayesian update scheme, but tailor it to our reconstruction as described in Section III-A.
Ii-B Map based Labelling
Labelling the 3D representation of a scene generally involves label inference from dense 3D representations such as TSDFs or voxel grids which are known to be expensive to process. Hence, approaches for labelling 3D representations have mainly been put forward for single objects [Maturana:Schererl:IROS2015, Wu:etal:CVPR2015]. Recently, several have focused on representing and processing 3D data more efficiently [WurmH:etal:IROS2011, riegler2017octnetfusion, wang2017cnn, yu2015multi]. Although most approaches focus on voxel grids, some approaches for classification of point clouds have been explored [Qi:etal:CVPR2017]. Efficient 3D labelling at large scale, however, remains unsolved and only few have ventured into labelling a variably-sized reconstructed 3D scene. Landrieu et al. extended the idea of superpixels to 3D point clouds and proposed a superpoint method to label large scale LIDAR scans [landrieu2018large]. Dai et al. [Dai:etal:CVPR2018] proposed a fully convolutional, autoregressive, hierarchical coarse-to-fine 3D network to produce semantic labels together with geometry completion for a large 3D voxel grid scene. However, due to the expensive 3D nature of their input, the different levels of hierarchy in their network have to be trained separately. Roddick et al. [roddick2018orthographic] project image features into an orthographic 3D space using a learned transformation, which removes the scale inconsistency, and creates a feature map with meaningful distances and without projective distortions of object appearance. They improve the efficiency of object detection from the orthographic map by collapsing voxel features along the vertical axis and then process the entire map of at a grid resolution of at once. However, they do not address scalability in their method. In the map-based labelling approach applied in this paper, as in the methods above, we directly label the reconstruction. We employ a sliding window method dependent on our network’s receptive field (see section III-B) that allows scaling to arbitrarily large maps by avoiding potential GPU memory limitations when processing the entire map in a single forward pass.
Ii-C Hybrid Methods
Several methods employ a combination between map-based and view-based segmentation. Finman et al. [finman2014efficient] incrementally segment a 3D point cloud using graph-based segmentation. They create small sub-parts of the map by joining the segmented parts with the border points of the existing map which they segment and fuse into the rest of the semantic map. Methods [McCormac:etal:ICRA2017] and [narita2019panopticfusion] combine view-based incremental labelling with a map-based smoothing step using a 3D Continuous Random Field. Vineet et al. [vineet2015incremental] model their voxel grid with a volumetric, densely connected, pairwise Conditional Random Field and obtain semantic labels by first evaluating unary potentials in the image domain and then projecting them back onto the voxel grid. Dai et al. [dai20183dmv]
project 2D features extracted from multiple views onto a 3D voxel map, and use both geometry and projected features to predict per-voxel labels.
The scenario for our experimental comparison is table-top reconstruction and semantic labelling of a scene containing scattered objects, selected from a number of ShapeNet categories [Shapenet:ARXIV2015], as a depth camera browses the scene in an ad-hoc way. Since our focus is on a fundamental comparison of view-based and map-based labelling, we choose a height map representation for our scenes whose nature allows us to use the same CNN network architecture designed for RGB-D input for both labelling methods. We use Height Map Fusion [Zienkiewicz:etal:3DV2016] as our scene reconstruction backend. For our experiments, we opt for a synthetic environment based on rendered RGB-D data using the methodology from SceneNet RGB-D [McCormac:etal:ICCV2017]. A key reason for this decision is the need for a wide variety of scene configurations with semantic label ground truth in order to train high-performing view-based and map-based semantic segmentation networks. Furthermore, synthetic data gives us a high level of control over multiple experimental factors, such as the variety of viewpoints, noise, and ground truth for RGB, depth and camera poses.
Iii-a Incremental Label Fusion
In the incremental fusion part of our study, we build on the real-time, multi-scale height map fusion system of Zienkiewicz et al. [Zienkiewicz:etal:3DV2016], by augmenting it with a semantic fusion capability. In [Zienkiewicz:etal:3DV2016], the height map of a scene is modelled using a triangular mesh whose vertices have horizontal coordinates on a regular grid and associated variable heights which are estimated from incremental fusion. This system uses ORB-SLAM [Mur-Artal:etal:TRO2015] as a camera tracker, and geometry measurements can come from either a depth camera or incremental motion stereo in the pure monocular case. In our setup, which is based on synthetic data, we experiment with both ground truth as well as noisy camera poses and depth maps.
To add a semantic label fusion capability to the system, we associate a discrete distribution of semantic classes with every vertex of the mesh, and refine this distribution iteratively by projecting view-based semantic predictions onto the mesh in a per-surface-element-independent manner as in [McCormac:etal:ICRA2017]. For every vertex, only the pixels projected onto the adjacent faces contribute to the Bayesian update. We seek to compute the posterior distribution over semantic classes for a certain vertex, given projected measurements on adjacent faces for all timesteps . We define the measurement as the network’s prediction for a single pixel given the image. We apply Bayes Rule to as follows:
We assume conditional independence of the measurements given the vertex class, i.e. , and can thus rewrite the above as:
describing the relation between posterior , measurement likelihood , and a-priori distribution . Note that we dropped the normalisation constant during the derivation and thus must normalise the posterior after evaluation.
For computational reasons, we also assume spatial independence of the measurements and thus factorise the measurement likelihood into:
where denotes the set of pixels whose rays intersect with the surfaces adjacent to the given vertex and is the measurement likelihood at pixel and time given the vertex class . Since the projected measurement locations do not coincide with the location of the vertex, we model the measurement likelihood using a distance based decay :
where is the Euclidean distance between the vertex and the projected pixel, is a tuning parameter defining the decay rate and and are scaling factors based on the total number of semantic classes which ensure that
models a uniform distribution as:
Intuitively, the closer the projected pixel is to the vertex, the more likely the pixel class is to coincide with the vertex class. The likelihood of a measurement being of any class other than the observed one is distributed uniformly.
Finally, the output of our network is not directly a measured class, but rather a distribution over possible classes . This can be dealt with according to Bayes by evaluating a weighted average over classes:
where replaces the measurement function for evaluating in Equation (3).
Iii-B Map-based Labelling
We model map-based labelling as a one-off segmentation of the entire reconstruction. In principle, labelling the scene directly could be implemented via a single pass through a very large CNN, but here, we propose to sequentially crop and segment parts of the map using a sliding window approach. Our choice makes our method scalable, as the sliding window can be applied to an arbitrarily sized height map without memory restrictions. While a naive sliding window approach of tiling the map and processing each sub-height-map would result in a loss of context in the border regions of each crop, we ensure correct segmentation by choosing the sliding window offset based on the theoretical receptive field of the network:
where is the dimension of the network input in the sliding direction (width or height). Note that the sliding window offset is conservatively chosen and could practically be increased by considering the network’s effective receptive field [luo2016understanding]. With this method (see Figure 2), we ensure the same context for every pixel as would be obtained during a single forward pass of the entire map through our network, while avoiding GPU memory limitations.
Iii-C Scene Dataset Generation
We created 647 scenes composed of objects selected from the ShapeNet taxonomy [chang2015shapenet] of the object categories computer keyboard, keypad, remote control, remote and airplane, aeroplane, plane, which are suitable for a height map representation as they typically have little overhang. For each category, we select instances, split into training, validation and test data with fractions and respectively. We chose SceneNet RGB-D [McCormac:etal:ICCV2017] for its photorealistic rendering and adapt its rendering engine to render height maps by simulating an orthogonal camera using the OptiX Raytracing engine. We generate random scenes by selecting and placing instances from each object category, sampling from a uniform distribution for object ID, position and orientation. We generate random backgrounds to increase variability in appearance. We then create data to train our view-based and map-based labelling approaches. For the former, we extract scene views () at random camera locations with RGB, depth and semantic ground truth. For the latter, we use height map fusion with ground truth camera poses to reconstruct height maps () for which we obtain semantic ground truth by rendering the same scene in SceneNet RGB-D [McCormac:etal:ICCV2017]. We then extract map samples of () at random locations. Note that while we vary the camera height between and
and extract views with large angle variance (), the rendered height maps exhibit canonical scale and orientation. We generate training and validation scenes from which we use views and height map crops to train our view-based and map-based networks respectively. From our test scenes, we use views and crops to evaluate network performance and use entire scenes in our evaluation pipeline (see Section III-E) to compare both methods during scene reconstruction. An example of our synthetic dataset is shown in Figure 3.
Iii-D Network Architecture and Training
We use a fully convolutional network based on the Fusenet architecture [hazirbas2016fusenet], which uses two parallel encoder branches (one for RGB, one for depth), to predict semantic labels for every pixel (see Figure 4). We experiment with different network architectures to obtain the best performing network with a minimum number of parameters. The best performing model was trained with a batch size of , a drop-out rate of and a learning rate of with exponential decay of base every
steps. All our models were developed with Tensorflow using the Adam optimizer[kingma2014adam].
Iii-E Evaluation and Comparison
We evaluate and compare both approaches using our extension of Semantic Height Map Fusion [zienkiewicz2016monocular]. We generate test sequences from our test scenes which have camera locations relative to the scene, randomly sampled from a range which stochastically achieves full scene coverage. For each test sequence, we deploy the view-based network during the reconstruction of a synthetic scene and use our Bayesian Fusion update scheme to obtain a semantically labelled height map. We save the reconstruction at regular intervals where it is labelled by the map-based network. The semantic scene segmentation obtained by each approach is compared against the scene’s ground truth using the mean Intersection over Union () over all classes.
Iv Experiments and Results
Iv-a Training results
Using our best architecture (8 convolutional layers and skip layers at every downsampling step), we achieved a mean IoU of and for the view-based and map-based tasks respectively (see Table I). We suggest that this discrepancy in accuracy, occurring despite identical architecture and training sample number, is due to the different characteristics of the data seen by the networks, arising from the nature of their tasks and the distributions of their views. Compared to the view-based task, the map-based task is easier to learn since all map crops are taken from a canonical top-down orientation of the camera. The lower variability can possibly also lead to stronger overfitting on the training data and could explain the lower performance of our map-based method on the test data. Qualitative results for both networks can be seen in Figure 5.
|mIoU||surface||remote control||keyboard||model plane|
|Vb||0.95 %||0.99 %||0.89 %||0.74 %||0.51 %|
|Mb||0.93 %||0.99 %||0.68 %||0.91 %||0.78 %|
Iv-B Evaluation of the View-Based method
We evaluate our view-based method on our test scenes by generating a semantic height map using our semantic fusion algorithm as described in III-E. We evaluate at every 100 frames, to track the increasing semantic accuracy of the reconstructed scene. On average, our view-based method reaches a value of mean IoU after frames. Figure 6 (left) displays our results for all test scenes, together with the average performance. As illustrated by the semantic reconstruction example at different coverage levels (number of seen frames) in Figure 1, the rapid early improvement in label accuracy is obtained from increasing coverage of the map, though the IoU continues to improve slowly due to fusion once full coverage has been achieved at around 400 frames. This is not surprising given the high labelling performance of the network on single images. With a more poorly performing network we would expect an even higher increase in accuracy from incremental fusion. We tested the influence of the alpha parameter in Equation 5 on our semantic reconstruction. It did not affect the reconstruction strongly in this setting and we chose a value of for our experiments. We further performed an analysis of computational time for our view-based method, presented in Table II.
|Data loading||6.13 ms|
|Semantic segmentation (1 forward pass)||77.00 ms|
|Semantic fusion||31.05 ms|
Iv-C Evaluation of the Map-Based Method
We evaluate our map-based method using the same test sequences. We reconstruct the scene geometry using variable number of frames (up to ) and segment the reconstructed height maps using the sliding window method with a shift set by the theoretical receptive field of of our network. Unlike the view-based method, we start the map-based evaluation once full coverage of the scene has been reached, to avoid segmenting incomplete image patches which lie outside of the network’s learned distribution. Figure 6 (right) shows our results over all scenes. Our map-based method achieves an average mean IoU of . We measure the average time to segment one reconstructed map on a single GPU as .
Iv-D Both approaches in comparison
We compare the mean IoU achieved on average by both methods on our test scenes during reconstruction, evaluating at every and frames for the view-based and for the map-based method respectively, demonstrating each method’s improvement w.r.t. the reconstruction state. We also compare both methods with regards to processing time to evaluate their efficiency. The results are shown in Figure 7. For the view-based method, the computation time of the reconstruction, frame-based semantic segmentation and semantic fusion is measured, and for the map-based approach, we measure the computation time for the reconstruction and the one-off scene labelling. Note that the overall processing time of our map-based method could be reduced further, if we had only segmented the map once after reconstructing the full scene. Our results show that on average, with overall much less computation time, the map-based method achieves a segmentation accuracy superior to the view-based method. However, after full coverage has been reached, we observe a region in which the view-based method achieves higher labelling accuracy than the map-based method (see Figure 7 on the right). This demonstrates that for the map-based method to work well, a certain level of reconstruction accuracy has to be reached. We further observe that the map based method performs better at the contours of objects than the view-based method. This is visualised in the error maps of the reconstruction example (Figure 1).
Iv-E Comparison with reduced map reconstruction quality
We experiment with degrading the map quality to evaluate the decrease in labelling quality w.r.t. noise, on the compared methods. In a first study, we apply normally distributed pose disturbances during reconstruction. Our results (Figure8) show that the map-based method is much less robust to pose noise than the view-based method, which can be attributed to the Bayesian filtering of multiple views. We then apply depth noise drawn from a normal distribution which has a stronger negative effect on the view-based method, most likely because the latter now has to deal with reduced quality in 2D labels as well as projection errors. Plotting both methods against pose and depth noise (Figure 9) shows that while the view-based method is robust to pose noise, it quickly degrades to mean IoU with increasing depth noise. On the other hand, the map-based method is sensitive to both noise types, but to a lesser degree, degrading to only mean IoU in the tested noise range.
Our results show that for a setting which assumes perfect poses and depth, the map-based approach achieves higher labelling accuracy, and can be achieved with less computation. We argue that although the view-based network achieves a high labelling accuracy on individual frames, its deployment during the early reconstruction phase results in more errors, especially in border regions of objects, from which it cannot always recover easily. The overhead of repeated forward passes and multi-view fusion is a further disadvantage of this approach. Our experiments on pose and depth noise show that for the studied noise range, the map-based approach, although more equally affected by both types of noise, stays overall more robust than the view-based method (Figure 9). We argue that its advantage results from operating on fused data. The view-based method on the other hand is less affected by pose noise, due to its in-view labelling and the benefits of Bayesian label fusion. However, it strongly degrades in the presence of depth noise. We argue that this is caused by the fact that it has to operate directly on the noisy depth data, resulting in not only projection errors, but also 2D segmentation errors.
Overall, the present results should be carefully considered within the context of our experiments. Firstly, the 2.5D geometry of the height map alleviates the map-based method from the memory limitations of 3D segmentation tasks. Secondly, the appearance features of our selection of scattered objects with little overhang are well visible from a top-down perspective, while for other objects (e.g. cups, bowls, chairs) with more ambiguous features it would be more difficult to train a top-down network. In settings with less well performing networks, one-off labelling of the map segments would likely yield more errors which would in turn be better recovered with the view-based method due to multi-view Bayesian fusion of labels. While we don’t cover the cases of differently well performing networks in this study, we would like to leave it for future work.
We conclude that in the absence of noisy data, the map-based approach shows higher labelling accuracy and object-border details. In the presence of pose noise, only affecting the reconstruction, the view-based method shows a significant advantage, but it deteriorates more strongly in the presence of depth noise. In terms of computational cost, the map-based approach is in principle more efficient, given that every map element is processed only once, compared to the frame-wise segmentation and fusion required by the view-based method. In a three dimensional setting, this advantage will likely be reduced due to expensive 3D data processing such as volumetric labelling, but we leave this analysis for future work, as it requires the comparison between different representations of 3D data and their different methods of 3D labelling. We see a further point of continuation in combining both view-based and map-based methods, leveraging the respective advantages in the correct setting. For instance, a real-time SLAM system would benefit from incremental fusion in the initial mapping phase and a regular map-based label refinement step in well-reconstructed regions of the map.