Active Scene Understanding via Online Semantic Reconstruction

06/18/2019 ∙ by Lintao Zheng, et al. ∙ 6

We propose a novel approach to robot-operated active understanding of unknown indoor scenes, based on online RGBD reconstruction with semantic segmentation. In our method, the exploratory robot scanning is both driven by and targeting at the recognition and segmentation of semantic objects from the scene. Our algorithm is built on top of the volumetric depth fusion framework (e.g., KinectFusion) and performs real-time voxel-based semantic labeling over the online reconstructed volume. The robot is guided by an online estimated discrete viewing score field (VSF) parameterized over the 3D space of 2D location and azimuth rotation. VSF stores for each grid the score of the corresponding view, which measures how much it reduces the uncertainty (entropy) of both geometric reconstruction and semantic labeling. Based on VSF, we select the next best views (NBV) as the target for each time step. We then jointly optimize the traverse path and camera trajectory between two adjacent NBVs, through maximizing the integral viewing score (information gain) along path and trajectory. Through extensive evaluation, we show that our method achieves efficient and accurate online scene parsing during exploratory scanning.



There are no comments yet.


page 1

page 3

page 5

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the proliferation of commodity RGBD sensors and the boosting of 3D deep learning techniques, 3D scene understanding based on RGBD data has been emerging as a core problem of 3D vision and gained much attention from both graphics and vision community lately 

[SLX15, GAGM15, NKP19]. The majority of existing works pursues offline, passive analysis, in which scene understanding, encompassing object detection and/or segmentation, is conducted over already acquired RGBD sequences or their 3D reconstruction. In such approach, data acquisition is analysis-agnostic. Therefore, the offline analysis often suffers from incomplete and uninformative data acquisition which greatly limits the performance of scene understanding.

Online scene understanding is a different paradigm in which acquisition and analysis are intertwined [XHS15, LXS18, YLL18]: While scene analysis is conducted online based on the progressively acquired scene data, scene scanning, on the other hand, is driven by the requirement of efficient scene understanding. Such a coupled solution fits well for robot-operated autonomous scene understanding: The robot actively selects scanning views and traversing paths to cover the regions which may best facilitate scene parsing, with a minimum traversing and scanning effort. Therefore, scanning is both driven by and targeting at understanding.

Online scene understanding can be performed either directly over online acquired RGBD sequence or based on online RGBD reconstruction. Most recent works usually adopt the former due to the deep-learning-friendly representation of RGBD images [GAGM15]. However, 3D object segmentation should best be performed over the 3D reconstruction of scene geometry which facilitates 3D spatial and structural reasoning [ZXTZ14, XHS15]. Modern real-time RGBD reconstruction usually adopts the volumetric depth fusion approach [NDI11, IKH11, NZIS13, WLSM15, DNZ17], where the depth images acquired in real time are registered and fused into a volumetirc representation of scene geomtery, i.e., Truncated Signed Distance Field (TSDF) [CL96]

. Volumetric representation is well suited for 3D feature learning based on deep neural networks 


Inspired by the recent work of semantic scene segmentation with volumetric representation [HDN18]

, we propose a method of active scene understanding based on online RGBD reconstruction with volumetric segmentation. Based on the online reconstructed TSDF volume, our method leverages a deep neural network to perform real-time voxel-based semantic labeling. The network contains a 2D feature extraction module used for extracting 2D features from multi-view RGB images as well as an incremental 3D feature aggregation module specifically designed for real-time inference. The 3D feature fusion and spatial reasoning based on the online updated volume lead to reliable online semantic segmentation.

The robot scanning is guided by an online estimated discrete viewing score field (VSF) parameterized in the 3D view space of 2D location and azimuth rotation. VSF stores for each view a score measuring how much it reduces the the uncertainty (entropy) of both geometric reconstruction and semantic labeling. Based on VSF, we select the next best views (NBV) as the target for each time step. We then jointly optimize the traverse path and camera trajectory between two adjacent NBVs, through maximizing the integral viewing score (information gain) along path and trajectory in the view space. We have conducted extensive experimental evaluations and comparisons and show that our method achieves fast, accurate and complete scene parsing outperforming the state-of-the-arts.

To sum up, the contributions of this work are:

  • A new approach to active scene understanding based on online semantic reconstruction.

  • An efficient semantic segmentation network with incremental volumetric feature aggregation.

  • A method for estimating the next best view based on the uncertainty in scene reconstruction and understanding.

  • A method for joint optimization of robot path and camera trajectory in three-dimensional view space.

2 Related Works

Figure 1: An overview of our method. Given the current reconstruction and understanding in (a), the robot performs online progressive reconstruction and entropy map computation/updating (b). Based on that, the view scoring field (VSF) is generated (c). Based on VSF, it performs field-guided optimization of robot path and camera trajectory (d). (e) shows the online reconstruction with semantic segmentation and (f) visualizes the updated entropy map for the next iteration.

Scene understanding

Scene understanding has been a long-standing problem in both vision and graphics. The two main problems of scene understanding are scene classification and semantic parsing (object detection and/or segmentation). With the development of commodity depth sensors, the input of interest has been shifting from 2D RGB images [LSFFX10], 3D CAD models [FSH11, XMZ14] or 3D point clouds [NXS12], to RGBD images [GAM13, SX16] and/or their 3D reconstruction [KKS13, HDN18, MHDL17]. To take the advantage of deep learning, much attention has been paid on designing suited representation and efficient neural networks for the task of RGBD-based understanding [SHB12, CDF17, SYZ17, QLJ17]. Most existing works have hereunto been devoted to offline, passive understanding based on the already acquired scene data. There are surprisingly limited works studying how to actively acquire scene data which are most useful for scene understanding, albeit the availability of real-time RGBD acquisition and reconstruction in nowadays.

Online RGB-D reconstruction

With the introduction of commodity depth cameras, we have seen significant advances in online RGB-D reconstruction. KinectFusion [NDI11, IKH11] was one of the first to realize a real-time volumetric fusion framework of [CL96]. In order to handle larger environments, spatial hierarchies [CBI13], and hashing schemes [NZIS13, KPR15] have been proposed. At scale, these methods also required robust, global pose optimizations which are common in offline approaches [CZK15]; however, fast GPU optimization techniques [DNZ17] or online re-localization methods [WLSM15] allow for real-time global pose alignment. Our work builds upon this line of research to achieve active RGBD-based scene understanding.

Active object recognition

Autonomous object detection and/or recognition is one of the most important ability of domestic robots. A common solution to active object recognition is to actively resolve ambiguities of a certain viewpoint in recognizing an object. In cases where the target object is known, Browatzki et al. [BTM12] define characteristic views, on a view sphere around the object, which are most beneficial in discriminating similar objects. Potthast et al. [PBSS16]

introduce an information-theoretic framework combining two common techniques: online feature selection for reducing computational costs and view planning for resolving ambiguities and occlusions. Similar idea was also utilized in 

[XSZ16] for active, fine-grained object recognition. Song et al. [SZX15] propose an information-theoretic approach based on 3D volumetric deep learning [WSK15]. When target objects are unknown, detection and recognition need to be solved simultaneously. Ye et al. [YLL18] propose navigation policy learning guided by active object detection and recognition. The work in [LXS18] is the most similar in spirit to ours. They develop a data-driven solution to autonomous object detection and recognition with one navigation pass in an indoor room. The problem is formulated as an online scene segmentation with database 3D models serving as templates. Our work frames the problem as online volumetric reconstruction and deep-learning-based voxel labeling.

Active scene segmentation

Semantic segmentation of an indoor scene is critical to accurate robot-environment interaction. However, many existing approaches do not involve an online active view selection. Mishra et al. [MAF09] propose fixation-based active scene segmentation in which the agent segments only one image region at a time, specifically the one containing the fixation point by an active observer. Similar method is also studied in [BK10] which integrates different cues in a temporal framework for improving object hypotheses over time. Xu et al. [XHS15] present an autoscanning system for indoor scene reconstruction with object-level segmentation. They adopt a proactive approach where objects are detected and segmented with the help of physical interaction (poking). In our system, scene segmentation is achieved by actively selecting the best view points and traverse paths that maximally determine the volumetric labeling.

3 Method

3.1 Problem Statement and Overview

Problem statement

Given an indoor scene whose map is unknown, the objective of our system is to drive a ground robot mounted with an RGBD camera to explore and actively parse the scene into semantic objects. It is impossible to plan the complete scan path in advance since the map of the target scene is unavailable at the beginning, which makes it a chicken-and-egg problem. We therefore have to solve for scene understanding and path planning simultaneously. Existing approaches to active scene scanning usually takes a “scan and plan” paradigm, which only takes geometric but not semantic information into consideration when planning the robot scanning. In this work, we frame the problem from online reconstruction with semantic segmentation and propose a novel “scan, understand, and plan” solution.

Method overview

For the purpose of online scene understanding, we introduce a semantic segmentation network based on online volumetric reconstruction, inspired by [HDN18]. The basic idea of our network is to first extract multi-view 2D features and then perform feature aggregation based on 3D convolution over the online reconstructed TSDF volume. Different from the offline scene understanding in [HDN18], the input for semantic labeling is dynamic due to the progressive scanning and online reconstruction. Therefore, the feature aggregation must follow the online reconstructed TSDF volume. Futhermore, to avoid redundant computation, our network bypasses the known and unchanged voxels in the TSDF volume during feature aggregation, thus significantly improving the online efficiency.

To guide the robot in achieving an fast online semantic reconstruction with minimal scanning effort, we adopt an information-theoretic approach to Next-Best-View (NBV) prediction through minimizing uncertainty (entropy) of semantic reconstruction. The entropy measures the uncertainty of both geometric reconstruction and semantic segmentation. In particular, we present a field-guided optimization of robot path and camera trajectory to maximize the information gain in traversing and scanning between every two adjacent NBVs.

An overview of the process is given in Algorithm 1. Any scanning move of the robot would collect some semantic information of the unknown scene. A entropy-based (section 3.3) view scoring field is generated based on the online reconstructed TSDF with semantic labels ((section 3.2)). To maximize the scanning efficiency, the Next-Best-View (NBV) should enable the robot to reduce the overall entropy as much as possible in the next move. Based on the online updated entropy map and occupancy grid , we compute a view scoring field , based on which and robot path and camera orientation can be optimized jointly (section 3.4). The above process repeats until the terminate condition is met.

Input :  Initial TSDF , occupancy grid with few random scans and robot location
Output :  Semantic label and optimized scanning path
1 Initialize ;
2 Initialize entropy map from and ;
3 Initialize view scoring field ;
4 repeat
        // Path planning and camera rotation optimization based on
5        Find NBV ;
6        Find the optimal robot and camera path from to ;
        // Update scene mapping based on given path
7        Scan along and update semantic map ;
8        Update ;
9        Update from and ;
10        Update from ;
        // Record current path planning
11        ;
13until Terminate condition is met;
14return , ;
Algorithm 1 Robot scanning guided by online reconstruction and semantic segmentation .

3.2 Online Reconstruction with Semantic Segmentation

We measure the quality of a scan view by how much the uncertainty of scene understanding would be reduced through this move. In our work, the uncertainty of scene understanding is measured from two aspects, i.e., geometry reconstruction and semantic segmentation.

Figure 2: The architecture of our online semantic segmentation network. Note that the key difference between our network and 3D-SIS[HDN18] is that our feature aggregation is incremental. The input of our network contains two components, which are local and global(red box). We save massive processing time on redundant operations of overlapped areas. More specifically, the voxels which have been observed in previous steps would not go through the 3D convolution layers and 3D-ROI (blue box) again. Our network would re-unitize the stored information directly.

RGBD-based reconstruction with volumetric representation

Given a sequence of RGBD images, we adopt the volumetric representation (TSDF) for depth fusion [CL96]. The construction of TSDF is incremental. The occupancy uncertainty of each voxel is reduced when more images are fused into . Usually, the occupancy of

can be modeled based on a 1D half normal distribution:

. The variance

provides a measure of reconstruction uncertainty. More specifically, the variance is defined based on how many images provide positive support for the occupancy of  [HWB13].. The positive support here means the camera get a reflected signal from when shot a depth image and vice versa. To make it simple, every positive support would provide and every negative support would provide penalty .


Semantic reconstruction network

To incrementally gain semantic information during scanning, we propose a network to predict a 3D semantic segmentation based on the TSDF . More specifically, we want to infer the semantic labeling over the TSDF on a per-voxel basis. The backbone of our network is similar to [HDN18]. We first briefly review the network architecture and then discuss our improvement over it.

The network is composed of two main components including object detection and per-voxel labeling prediction. Each of these component has its own feature extraction module. Each module is composed of a 2D and 3D feature extraction layers. The extracted 2D and 3D features are aggregated by a series of 3D convolutional layers over the TSDF volume. The object detection component comprises a 3D region proposal network(3D-RPN) to predict bounding box locations, and a 3D-region of interest (3D-RoI) pooling layer followed for classification. The per-voxel mask prediction network takes geometry as well as the predicted bounding box location as input. The cropped feature channels are used to create a mask prediction for per-voxel semantic labeling as well as the confidence score.

However, this network is designed for offline scene understanding where the reconstruction is already given. In our problem setting, the online reconstruction is executed online, with smooth and progressive RGBD acquisition. This means that there is immense overlap () between the observations of every two adjacent RGBD frames. Directly applying feature aggregation would result in much computational redundancy. To support real-time application, we make a modification to this network to reuse the previous feature aggregation as much as possible.

Incremental 3D feature aggregation

Most offline scene understanding methods do not consider how to process dynamic inputs, we present an incremental semantic segmentation network specifically designed for online understanding. The key insight of our approach is that 3D convolution should be performed only on the newly observed voxels and reuse the previous result for overlapping areas as much as possible. Figure 3 gives an illustration of this.

Figure 3: An illustration of incremental semantic segmentation.

More specifically, we maintain a global data structure to record the TSDF and 3D features information for all the observed voxels. When the network get a new local input, the first step is finding the overlapped areas between the input and the global record. And our proposed network would skip the 3D convolution and reuse the stored information directly for this overlapped areas which would save a lot of computational time.

Moreover, our network would also reuse the results of the 3D-RPN. All the box proposals in the overlapped areas would not be processed again for a new local input. By removing these redundant proposals, our incremental network would improve the efficiency one step further. In our experiment, the incremental process would make our network be 23.6% faster if the input has 50% overlapped area and 41.1% faster if the input has 75% overlapped area. More details about out network can be found in Figure 2.

Figure 4: The evolution of scanning entropy over increasing scans.

3.3 Reconstruction and Segmentation Entropy

We adopt Shannon entropy to measure the information gain of robot scanning. In particular, we estimate the average new information the robot can collect under a specific pose. In other words, we want to measure how much uncertainty would be reduced by a potential scanning view.

The entropy map is defined on each voxel in the 3D scene. Different from previous method like  [BWCE16], we do not only count the geometry occupancy possibility of each voxel but also the predict semantic label as new information. The general definition of entropy in our problem is and we can measure the gained information as . The key point to evaluate the quality of new information through

is how to define probability

in for geometry and semantic information respectively. Then we can sum these two item up in a weighted fashion to get final formulation of the gained information. and are constants to weight the geometry term and the semantic term.


Geometry reconstruction entropy

As we discuss in Section 3.2, the uncertainty of voxel in geometry reconstruction can be defined as Equation (1). However, the output range of this uncertainty formulation is which can not be adopted as the probability function in a entropy formulation directly. We simply map this uncertainty function to

as below and use it in our geometry reconstruction entropy.

Figure 5: Visualization of scanning entropy encompassing both geometric and semantic uncertainty.

Semantic segmentation entropy

To measure the uncertainty for semantic segmentation, we should take both the predicted semantic label and corresponded confidence score for a voxel into consideration. If the predicted semantic label for in current scan move keeps the same as the previous predicted result and the confidence score becomes higher, then the uncertainty for semantic segmentation for is reduced. And there is another case that we have gained more information for which the confidence score is higher even the predicted semantic labels are different. Therefore, we have a the following formulation for semantic segmentation entropy, where represents score of semantic prediction given by our semantic reconstruction network for a specific label .


where denotes the new observe for voxel and is semantic label of .

To make our idea about this combined entropy more clear, we present a visual example about our entropy in Figure 5. Note that the higher entropy value means lower confidence since we need more information to be more confident. It is clear that in Figure 5, if we consider only the geometry term, the robot has no idea about what to do next since the uncertainty for geometry reconstruction are similar everywhere in this case. But the semantic term give a very good guidance about which area should be focused in the next move since there is a valid semantic object(sofa) in the right view.

To show that our combined entropy is positively correlated to the quality of semantic labeling, we provide a side-by-side visual comparison in Figure 4. In this way, the objective of our NBV prediction is clear now. The NBV should be the view that all the voxels in it have the highest uncertainty. Baesd on Equation (2), we have a formulation for NBV prediction as following where represents all the voxels in current camera view :


3.4 Field-guided Scan Planning

Figure 6: (a): Illustration of 3D parametric space of location and orientation. (b,c): Visualization of view scoring field and the optimal path found by algorithm between two camera poses in the field. (d): The computed robot trajectory based on field-guided optimization.

View scoring field

Our scan planning is composed of two components, NBV prediction and path planing with camera optimization. These two components are implemented upon a 3D field which records the entropy information which described in section 3.3. Please note that this field is incrementally constructed with the scanning process.

However, the gained information is not the only factor should be considered in our field construction. The following factors are also important components to formulate our view scoring field:

  • [itemindent = 15pt]

  • Safety: View point must be in free space and keep a safe distance away from obstacles;

  • Visibility: Views should orient toward objects or frontiers to maximize information gain;

  • Movement cost: Robot traverse path should be as short as possible.

In this case, we find Equation (7) is not sufficient to find the most appropriate NBV for our system. Besides the gained information , we also introduce the occupancy grid to measure the value in our view scoring field which would be helpful to measure the above three factors.

To ensure robot safety, we get obstacle information from 2D projection of , and only sample views which can keep a safe distance from obstacles.

The next factor we consider is visibility to frontier. Frontier is the boundary between (known) empty regions and unknown ones, which is a well known driving factor for robot exploration in robotics. We measure the visibility to frontier by counting the frontier voxels visible in the current view frustum. Specifically, it is estimated based on :


where is the given view from voxel and means all the voxels in this view frustum. And we need to make sure the planned path should not be too long. However, we can not get the exact path length before the final path planning.

Here we use an approximate distance estimation formulation to make this movement constraint. , where is the current robot location. After formulating all these factors, we will discuss the details about how to assemble them to get 3D view scoring field. For each ground grid voxel of the given scene, we sample some different views. And the safety, visibility and movement factor are calculated for each view of every . And we will have the final view score formulation for each grid voxel with different camera view , and we have a 3D visualization of this field in Figure 6(a):


Optimization formulation

We will update this view scoring filed after each scan move, and the NBV can simply computed by this simple optimization . However, the main challenge in this part lies in how to compute a collision-free path from the current robot position to the , so that the path maximizes the information gain of semantic reconstruction and minimizes the traverse distance.

To guarantee robot safety and scanning efficiency, the view scoring filed plays a significant role in path planning algorithm. Formally, we define as the total cost of the optimal path :


where is the set of all possible paths from location to , is maximum rotation speed of the robot camera and is a big constant, which we set to 500 in our experiment. which helps to adopt in our costmap. However, even we consider the safety factor when designing the view scoring field , it is still can not guarantee that the found path through Equation (10) is collision-free. we introduce a 2D Obstacle costmap to enhance to solve this problem.

The 2D obstacle map is obtained from projection of 3d occupancy grid map

. We generate the obstacle cost map by using two-dimensional Gaussian distribution.


We integrate the 2D obstacle map into the view scoring field and change the optimization formulation from Equation (10) to the following:


Scan planning by optimization

To solve the path and camera optimization defined in Equation (12) , we adopt algorithm to find the optimal solution in discrete level. Figure 6 illustrates how we get the camera view path and the robot path from the optimal path given by the 3D costmap. We project the optimal path which given by the algorithm to axis to get the camera rotation sequence and the projected path on plane is the optimal 2D robot path.

This “scan, analyze and plan” process is repeated until the terminate condition is met, leading to a progressive understanding by the robot. In our experiment, the robot will stop the exploration if the overall entropy is reduced below a certain threshold, which means there is no significant uncertainty for our scene understanding.

4 Results and Evaluation

There are three primary questions that we seek to answer with our experiments and evaluations.

  • [itemindent = 15pt]

  • How does our approach compare to previous work in terms of distance traveled, time cost, and semantic quality?

  • How much effect does semantic entropy item have on the results?

  • How well does field guided path planning improve the scanning efficiency?

4.1 System and implementation

Simulation setup

The simulation is conducted by using the Gazebo simulator [KH04]. We adopt a differential drive ground robot equipped with a virtual RGB-D camera simulating the Kinect v1 sensor. We assume the sensor has a depth range of m with m noise. The camera is mounted on top of the robot and has one DoF of azimuth rotation. To make the simulation more realistic, the ground robot will obtain a noisy pose estimation from the simulator. The simulation runs on a computer with Intel I7-5930K CPU (3.5GHZ *12), 32GB RAM, and an NVIDIA GeForce GTX 1080 Graphics card.


Our benchmark dataset is built upon the virtual scene dataset SUNCG. SUNCG contains 40K human-modeled 3D indoor scenes with visually realistic geometry and texture. It encompasses indoor rooms ranging from single-room studios to multi-floor houses. We select 180 scenes which are suitable for navigation and exploration task. These scenes have averagely rooms and the average total area is m. Different interiors including offices, bedrooms, sitting rooms, kitchens, etc. are involved in our dataset to guarantee the test variety. The dataset also provides ground truth object segmentation and labeling for the scenes.

Parameters and details

The 3D occupancy grid is constructed with a resolution of m. The resolution of viewing score field is m. In our experiments, the upper limit of linear and angular speed of the robot is m/s and per second, respectively. The coefficient ratio is set to and is set to .

4.2 Comparison and evaluation

In this section, we conduct a series of experiments and comparisons which focus on evaluating scanning efficiency and semantic mapping quality of our method. Since it is impossible to get the input scene fully labeled in voxel-wise, we evaluate the scanning efficiency by measuring the time for our system to achieve a given mass of correctly labeled voxels. To evaluate semantic quality, we measure the accuracy of final scene segmentation.

Comparison with alternative NBV methods

Our method is compared to several state-of-the-art NBV techniques: Bayesian optimization-based exploration method (BO) [BWCE16], information-theoretic planning approach (IG) [CKP15] and Object-Aware based scene reconstruction algorithm (NBO) [LXS18]. For all these methods, a fixed forward-looking virtual camera is used.

Scanning efficiency

We compare the scanning time and travel distance from the four kinds of approaches, while scanning the scenes virtually. The initial positions and orientations of the robot in all these methods are the same. The comparison about scanning time and traveled distance over correctly labeled voxel number is plotted in Figure 7. We observe that the scanning cost time and traveled distance are increasing as the scene semantic mapping gets more complete (more and more occupancy voxels get labeled). But the proposed approach always gets the least time and shortest distance.

Figure 7: Comparing scanning efficiency betwwen our method (red), NBO (green), BO (magenta) and IG (blue), over different scenes. It is measued by traveled distance and time over numbers of correctly labeled voxels.

Semantic segmentation performance

To evaluate the quality of semantic segmentation, we measure the segmentation accuracy and identified objects number (exclude wall, ceiling and floor) respectively . The segmentation accuracy and identified objects number over traveled distance is plotted in Figure 8. The number of correctly labeled voxels are increasing while the robot explores more area, and the accuracy increasing as well. From the results we can clearly see that our method gets the highest semantic accuracy and maximum number of identified objects almost all the way.

Figure 8: Comparing segmentation accuracy (left) and object recognition (right) against NBO, BO, and IG. Note that the numbers of total semantic voxels and objects can be obtained from groudTrue.
Figure 9: Effect of various entropy items on semantic segmentation performance (left) and exploration efficiency (right). The combined entropy leads to faster scene exploration and more complete segmentation results.

To demonstrate the superiority of our algorithm one step further, in Figure 11, we show some visual results of the final semantic segmentation. The results show that our scanning strategy leads to more complete and better results. For more visual results, please refer to the supplemental material.

Figure 10: Effect of field-guided path planning on scanning efficiency. The proposed algorithm (red) is compared against classical Dijstra (blue) method. Using field-guided path planning, the robot travels less distance and time.
Figure 11: Qualitative comparison of indoor scene semantic segmentation on SUNCG dataset. Note that different colors represent different semantic labels.

Ablation study on semantic entropy

Occupancy entropy tends to guide robot to explore more unknown space, while semantic entropy is more likely to guide robot to exploit scanned region. In this experiment, we investigate the effect of the semantic entropy item on the semantic segmentation efficiency and quality. Figure 9 shows the numbers of correctly labeled voxels and all observed voxels over robot traveled distance, with only occupancy entropy, with only semantic entropy and with combined entropy.

As shown in the plot, when the observed region is relatively few in an early stage, the benefit of semantic entropy is significant, due to the better exploitation of partial scanned scene. When the robot travels more distance, the occupancy entropy starts to take effect, which leads to faster discovery of unknown space and more voxels are observed. Since the semantic entropy has no ability to guarantee discovering more region, the robot sometimes get stucked in scanned region with only semantic entropy. With occupancy entropy only, the robot is faster to find new regions so that the total scanned voxels are always the highest. However, when the unknown space is few at later state, the robot has no idea where to find better observation, which leads to poor performance of semantic segmentation. The combined entropy gets the best performance in final scanning results, which works well on both exploration and exploitation jobs. In Figure 13, we compare semantic segmentation quality of these three different entropy. The visual results verify the above analysis.

Effect of viewing score field

In order to verify the efficiency of our viewing score field-guided path planning approach. We conducted a number of experiments in four synthetic scenes to compare our method with classical path planning algorithm of Dijkstra. Table 1 reports total scanning time, traveling distance for our field-guided and Dijkstra path planning, on these scenes. The termination conditions are set the same for both algorithms. From the comparison, field-guided panning can better save much scanning effort. To better demonstrate the superiority of our field-guieded approach, in Figure 10, we also plot the cost time and traveled distance over the number of correctly labeled voxels. It can be clearly seen that the field-guided path planning leads to faster scanning time and less traveled distance all the way.

Scene Area (m) Field-guided method Dijstkra method
Time (s) Distance(m) Time (s) Distance (m)
1 134.2 1237 169.7 1412 195.1
2 239.1 1480 195.9 1671 215.4
3 106.9 708 92.9 1041 124.4
4 129.5 1057 141.1 1289 171.3
Average 152.3 1121 150.0 1353 176.5
Table 1: Comparison between our field-guided method and Dijstkra method. Our field-guided method is much better on efficiency than Dijstkra method whether on time cost or explored distance.

In addition to the above results, we show more visual results of our active scene understanding in Figure 12. In these examples, it is clear that our scene understanding is guided by collecting more semantic information. Our method would try to drive the robot discover the most semantic objects in a local area before it enters a new area which would maximize the scene understanding efficiency.

Figure 12: Visualization of our active scene understanding process for three different scenes.
Figure 13: Qualitative comparison of semantic segmentation results with different entropy items (Left: global quality; Right: local quality).

5 Conclusions

We have presented a method for active scene understanding based on online RGBD reconstruction with volumetric segmentation. Our method leverages the online reconstructed TSDF volume and learns a deep neural network for voxel-based semantic labeling. It attains the following key features. First, the online scene segmentation is conducted over the online reconstruction, thus benefiting from the 3D spatial reasoning. Second, the robot scanning is guided by the information gain of both geometric reconstruction and semantic understanding. Third, the online estimated viewing score field (VSF) facilitates the joint optimization of moving path and camera orientation.

There are a few promising venues for future research. First, our NBV prediction is based on the VSF estimated online. A more favorable approach would be training a network to achieve an end-to-end NBV estimation. The difficulty lies in how to consider the uncertainty in both reconstruction and segmentation within one neural network. Second, we would like to explore the use of the proposed framework on a real robot. Third, our VSF-based path/trajectory optimization can be extended to support more flexible scanning setting, for example, a robot holding a depth camera in its arm, similar to [XZY17]. Last, another interesting future direction would be extending our framework to achieve multi-robot collaborative scene understanding.


  • [BK10] Björkman M., Kragic D.: Active 3d scene segmentation and detection of unknown objects. In 2010 IEEE international conference on robotics and automation (2010), IEEE, pp. 3114–3120.
  • [BTM12] Browatzki B., Tikhanoff V., Metta G., Bülthoff H. H., Wallraven C.: Active object recognition on a humanoid robot. In 2012 IEEE International Conference on Robotics and Automation (2012), IEEE, pp. 2021–2028.
  • [BWCE16] Bai S., Wang J., Chen F., Englot B.: Information-theoretic exploration with bayesian optimization. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2016), IEEE, pp. 1816–1822.
  • [CBI13] Chen J., Bautembach D., Izadi S.: Scalable real-time volumetric surface reconstruction. ACM Transactions on Graphics (TOG) 32, 4 (2013), 113.
  • [CDF17] Chang A., Dai A., Funkhouser T., Halber M., Niessner M., Savva M., Song S., Zeng A., Zhang Y.: Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV) (2017).
  • [CKP15] Charrow B., Kahn G., Patil S., Liu S., Goldberg K., Abbeel P., Michael N., Kumar V.: Information-theoretic planning with trajectory optimization for dense 3D mapping. In Proceedings of Robotics: Science and Systems (2015).
  • [CL96] Curless B., Levoy M.: A volumetric method for building complex models from range images. In Proc. of SIGGRAPH (1996), pp. 303–312.
  • [CZK15] Choi S., Zhou Q.-Y., Koltun V.: Robust reconstruction of indoor scenes. In Proc. CVPR (2015), pp. 5556–5565.
  • [DNZ17] Dai A., Nießner M., Zollhöfer M., Izadi S., Theobalt C.: Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Transactions on Graphics (TOG) 36, 3 (2017), 24.
  • [FSH11] Fisher M., Savva M., Hanrahan P.: Characterizing structural relationships in scenes using graph kernels. ACM Trans. on Graph. (SIGGRAPH) (2011).
  • [GAGM15] Gupta S., Arbeláez P., Girshick R., Malik J.: Indoor scene understanding with rgb-d images: Bottom-up segmentation, object detection and semantic segmentation.

    International Journal of Computer Vision 112

    , 2 (2015), 133–149.
  • [GAM13] Gupta S., Arbelaez P., Malik J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In Proc. CVPR (2013), pp. 564–571.
  • [HDN18] Hou J., Dai A., Nießner M.: 3d-sis: 3d semantic instance segmentation of rgb-d scans. arXiv preprint arXiv:1812.07003 (2018).
  • [HWB13] Hornung A., Wurm K. M., Bennewitz M., Stachniss C., Burgard W.: OctoMap: An efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots 34, 3 (2013), 189–206.
  • [IKH11] Izadi S., Kim D., Hilliges O., Molyneaux D., Newcombe R., Kohli P., Shotton J., Hodges S., Freeman D., Davison A., Fitzgibbon A.: KinectFusion: Real-time 3D reconstruction and interaction using a moving depth camera. In UIST (2011), pp. 559–568.
  • [KH04] Koenig N., Howard A.: Design and use paradigms for gazebo, an open-source multi-robot simulator. In International Conference on Intelligent Robots and Systems (2004), pp. 2149–2154 vol.3.
  • [KKS13] Kim B.-s., Kohli P., Savarese S.: 3d scene understanding by voxel-crf. In Proceedings of the IEEE International Conference on Computer Vision (2013), pp. 1425–1432.
  • [KPR15] Kahler O., Prisacariu V. A., Ren C. Y., Sun X., Torr P. H. S., Murray D. W.: Very high frame rate volumetric integration of depth images on mobile device. IEEE Trans. Vis. & Computer Graphics (ISMAR) 22, 11 (2015).
  • [LSFFX10] Li L.-J., Su H., Fei-Fei L., Xing E. P.: Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Advances in neural information processing systems (2010), pp. 1378–1386.
  • [LXS18] Liu L., Xia X., Sun H., Shen Q., Xu J., Chen B., Huang H., Xu K.: Object-aware guidance for autonomous scene reconstruction. ACM Trans. on Graph. (SIGGRAPH) 37, 4 (2018).
  • [MAF09] Mishra A., Aloimonos Y., Fermuller C.: Active segmentation for robotics. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems (2009), IEEE, pp. 3133–3139.
  • [MHDL17] McCormac J., Handa A., Davison A., Leutenegger S.:

    Semanticfusion: Dense 3d semantic mapping with convolutional neural networks.

    In 2017 IEEE International Conference on Robotics and automation (ICRA) (2017), IEEE, pp. 4628–4635.
  • [NDI11] Newcombe R. A., Davison A. J., Izadi S., Kohli P., Hilliges O., Shotton J., Molyneaux D., Hodges S., Kim D., Fitzgibbon A.: KinectFusion: Real-time dense surface mapping and tracking. In Proc. IEEE Int. Symp. on Mixed and Augmented Reality (2011), pp. 127–136.
  • [NKP19] Naseer M., Khan S., Porikli F.: Indoor scene understanding in 2.5/3d for autonomous agents: A survey. IEEE Access 7 (2019), 1859–1887.
  • [NXS12] Nan L., Xie K., Sharf A.:

    A search-classify approach for cluttered indoor scene understanding.

    ACM Trans. on Graph. (SIGGRAPH Asia) 31, 6 (2012), 137:1–137:10.
  • [NZIS13] Nießner M., Zollhöfer M., Izadi S., Stamminger M.: Real-time 3D reconstruction at scale using voxel hashing. ACM Trans. on Graph. (SIGGRAPH Asia) 32, 6 (2013), 169.
  • [PBSS16] Potthast C., Breitenmoser A., Sha F., Sukhatme G. S.: Active multi-view object recognition: A unifying view on online feature selection and view planning. Robotics and Autonomous Systems 84 (2016), 31–47.
  • [QLJ17] Qi X., Liao R., Jia J., Fidler S., Urtasun R.: 3d graph neural networks for rgbd semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 5199–5208.
  • [SHB12] Socher R., Huval B., Bath B., Manning C. D., Ng A. Y.: Convolutional-recursive deep learning for 3d object classification. In Advances in neural information processing systems (2012), pp. 656–664.
  • [SLX15] Song S., Lichtenberg S. P., Xiao J.: Sun rgb-d: A rgb-d scene understanding benchmark suite. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    (2015), pp. 567–576.
  • [SX16] Song S., Xiao J.: Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proc. CVPR (2016).
  • [SYZ17] Song S., Yu F., Zeng A., Chang A. X., Savva M., Funkhouser T.: Semantic scene completion from a single depth image.
  • [SZX15] Song S., Zhang L., Xiao J.: Robot in a room: Toward perfect object recognition in closed environments. arXiv preprint arXiv:1507.02703 (2015).
  • [WLSM15] Whelan T., Leutenegger S., Salas-Moreno R. F., Glocker B., Davison A. J.: Elasticfusion: Dense slam without a pose graph. In Proc. Robotics: Science and Systems (2015).
  • [WSK15] Wu Z., Song S., Khosla A., Yu F., Zhang L., Tang X., Xiao J.: 3D ShapeNets: A deep representation for volumetric shapes. In Proc. CVPR (2015), pp. 1912–1920.
  • [XHS15] Xu K., Huang H., Shi Y., Li H., Long P., Caichen J., Sun W., Chen B.: Autoscanning for coupled scene reconstruction and proactive object analysis. ACM Trans. on Graph. 34, 6 (2015), 177.
  • [XMZ14] Xu K., Ma R., Zhang H., Zhu C., Shamir A., Cohen-Or D., Huang H.: Organizing heterogeneous scene collection through contextual focal points. ACM Trans. on Graph. (SIGGRAPH) 33, 4 (2014), 35:1–35:12.
  • [XSZ16] Xu K., Shi Y., Zheng L., Zhang J., Liu M., Huang H., Su H., Cohen-Or D., Chen B.: 3D attention-driven depth acquisition for object identification. ACM Trans. on Graph. 35, 6 (2016), 238.
  • [XZY17] Xu K., Zheng L., Yan Z., Yan G., Zhang E., Nießner M., Deussen O., Cohen-Or D., Huang H.:

    Autonomous reconstruction of unknown indoor scenes guided by time-varying tensor fields.

    ACM Transactions on Graphics 2017 (TOG) (2017).
  • [YLL18] Ye X., Lin Z., Li H., Zheng S., Yang Y.: Active object perceiver: Recognition-guided policy learning for object searching on mobile robots. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2018), IEEE, pp. 6857–6863.
  • [ZXTZ14] Zhang Y., Xu W., Tong Y., Zhou K.: Online structure analysis for real-time indoor scene reconstruction. ACM Trans. on Graph. 34, 5 (2014), 159.