A Robust 3D-2D Interactive Tool for Scene Segmentation and Annotation

10/19/2016 ∙ by Duc Thanh Nguyen, et al. ∙ Deakin University Singapore University of Technology and Design 0

Recent advances of 3D acquisition devices have enabled large-scale acquisition of 3D scene data. Such data, if completely and well annotated, can serve as useful ingredients for a wide spectrum of computer vision and graphics works such as data-driven modeling and scene understanding, object detection and recognition. However, annotating a vast amount of 3D scene data remains challenging due to the lack of an effective tool and/or the complexity of 3D scenes (e.g. clutter, varying illumination conditions). This paper aims to build a robust annotation tool that effectively and conveniently enables the segmentation and annotation of massive 3D data. Our tool works by coupling 2D and 3D information via an interactive framework, through which users can provide high-level semantic annotation for objects. We have experimented our tool and found that a typical indoor scene could be well segmented and annotated in less than 30 minutes by using the tool, as opposed to a few hours if done manually. Along with the tool, we created a dataset of over a hundred 3D scenes associated with complete annotations using our tool. The tool and dataset are available at www.scenenn.net.



There are no comments yet.


page 2

page 3

page 4

page 5

page 7

page 9

page 10

page 11

Code Repositories


A 3D scene mesh annotation tool

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fig. 1:

High-quality 3D scene data has become increasingly available thanks to the growing popularity of consumer-grade depth sensors and tremendous progress in 3D scene reconstruction research [1], [2], [3], [4]. Such 3D data, if fully and well annotated, would be useful for powering different computer vision and graphics tasks such as scene understanding [5], [6], object detection and recognition [7], and functionality reasoning in 3D space [8].

Scene segmentation and annotation refer to separating an input scene into meaningful objects. For example, the scene in Fig. 1 can be segmented and annotated into chairs, table, etc. Literature has shown the crucial role of 2D annotation tools (e.g. [9]) and 2D image datasets (e.g. [10], [11], [12]) in various computer vision problems such as semantic segmentation, object detection and recognition [13], [14]. This inspires us for such tasks on 3D scene data. However, segmentation and annotation of 3D scenes require much more effort due to the large scale of the 3D data (e.g. there are millions of 3D points in a reconstructed scene). Development of a robust tool to facilitate the segmentation and annotation of 3D scenes thus is a demand and also the aim of this work. To this end, we make the following contributions:

  • We propose an interactive framework that effectively couples the geometric and appearance information from multi-view RGB data. The framework is able to automatically perform 3D scene segmentation.

  • Our tool is facilitated with a 2D segmentation algorithm based on 3D segmentation.

  • We develop assistive user-interactive operations that allow users to flexibly manipulate scenes and objects in both 3D and 2D. Users co-operate with the tool by refining the segmentation and providing semantic annotation.

  • To further assist users in annotation, we propose an object search algorithm which automatically segments and annotates repetitive objects defined by users.

  • We create a dataset with more than hundred scenes. All the scenes are fully segmented and annotated using our tool. We refer readers to [15] for more details and proof-of-concept applications using the dataset.

Compared with existing works on RGB-D segmentation and annotation (e.g. [16], [17]), our tool offers several advantages. First, in our tool, segmentation and annotation are centralized in 3D which free users from manipulating thousands of images. Second, the tool can adapt with either RGB-D images or the triangular mesh of a scene as the input. This enables the tool to handle meshes reconstructed from either RGB-D images [18] or structure-from-motion [19] in a unified framework.

We note that interactive annotation has also been exploited in a few concurrent works, e.g. SemanticPaint in [20] and Semantic Paintbrush in [21]. However, those systems can only handle scenes that are partially captured at hand and contain a few of objects to be annotated. In contrast, our annotation tool handles complete 3D scenes and is able to work with pre-captured data. Our collected scenes are more complex with a variety of objects. Moreover, the SemanticPaint [20] requires physical touching for the interaction and hence limits its capability to touchable objects. Meanwhile, objects at different scales can be annotated using our tool. In addition, the tool also supports 2D segmentation which is not available in both SemanticPaint [20] and Semantic Paintbrush [21].

2 Related Work

RGB-D Segmentation.

A common approach for scene segmentation is to perform the segmentation on RGB-D images and use object classifiers for labeling the segmentation results. Examples of this approach can be found in

[16], [17]. The spatial relationships between objects can also be exploited to infer the scene labels. For example, Jia et al. [22] used object layout rules for scene labeling. The spatial relationship between objects was modeled by a conditional random field (CRF) in [23, 24] and directed graph in [25].

In general, the above methods make use of RGB-D images captured from a single viewpoint of a 3D scene and thus could partially annotate the scene. Compared with those methods, our tool can achieve more complete segmentation results with the 3D models of the scene and its objects.

From 2D to 3D Labeling. Compared with 2D labels, 3D labels are often desired as they provide a more comprehensive understanding of the real world. 3D labels can be propagated by back-projecting 2D labels from image domain to 3D space. For example, Wang et al. [26]

used the labels provided in the ImageNet

[10] to infer 3D labels. In [3], 2D labels were obtained by drawing polygons.

Labeling directly on images is time consuming. Typically, a few thousands of images need to be handled. It is possible to perform matching among the images to propagate the annotations from one image to another, e.g. [3], but this process is less reliable.

Fig. 2:

3D Object Templates. 3D object templates can be used to segment 3D scenes. The templates can be organized in holistic models, e.g., [27], [28], [29], [30], or part-based models, e.g. [31]. The segmentation can be performed on 3D point clouds, e.g. [27], [29], [31], or 3D patches, e.g. [30], [28], [32].

Generally speaking, the above techniques require the template models to be known in advance. They do not fit well our interactive system in which the templates can be provided on the fly by users. In our tool, we propose to use shape matching to help users in the segmentation and annotation task. Shape matching does not require off-line training and is proved to perform efficiently in practice.

Online Scene Understanding. Recently, there are methods that directly combine 3D reconstruction with annotation to achieve online scene understanding. For example, SemanticPaint proposed in [20] allowed users annotate a scene by touching objects of interest. A CRF was then constructed to model each indicated object and then used to parse the scene. The SemanticPaint was extended to the Semantic Paintbrush in [21] for outdoor scenes annotation by exploiting the farther range of a stereo rig.

In both [20] and [21], annotated objects and user-specified objects are assumed to have similar appearance (e.g. color). Furthermore, since the CRF models are built upon the reconstructed data, it is implicitly assumed that the reconstructed data is good enough so that the CRF model constructed from the user-specified object and that of the objects to be annotated have consistent geometric representation. However, the point cloud of the scene is often incomplete, e.g. there are holes. To deal with this issue, we describe the geometric shape of 3D objects using a shape descriptor which is robust to shape variation and occlusion. Experimental results show that our approach works well under noisy data (e.g. broken mesh) and robustly deal with shape deformation while being efficient for practical use.

Online interactive labeling is a trend for scene segmentation and annotation in which the scalability and convenience of the user interface are important factors. In [20], the annotation can only be done for objects that are physically touchable and hence is limited to partial scenes. In this sense, we believe that our tool would facilitate the creation of large-scale, complete, and semantically annotated 3D scene datasets.

3 System Overview

Fig. 2 shows the workflow of our tool. The tool includes four main stages: scene reconstruction, automatic 3D segmentation, interactive refinement and annotation, and 2D segmentation.

In the first stage (section 4), the system takes a sequence of RGB-D frames and reconstructs a triangular mesh, called 3D scene mesh. After the reconstruction, we compute and cache the correspondences between the 3D vertices in the reconstructed scene and the 2D pixels on all input frames. This allows seamless switching between segmentation in 3D and 2D in later steps.

In the second stage (section 5), the 3D scene mesh is automatically segmented. We start by clustering the mesh vertices into supervertices (section 5.1). Next, we group the supervertices into regions (section 5.2). We also cache the results of both the steps for later use.

The third stage (section 6) of the system is designed for users to interact with the system. We design three segmentation refinement operations: merge, extract, and split. After refinement, users can make semantic annotation for objects in the scene.

To further assist users in segmentation and annotation of repetitive objects, we propose an algorithm to automatically search for repetitive objects specified by a template (section 7). We extend the well-known 2D shape context [33] to 3D space and apply shape matching to implement this functionality.

The fourth stage of the framework (section 8) is designed for segmentation of 2D frames. In this stage, we devise an algorithm that uses the segmentation results in 3D as initiative for the segmentation on 2D and bases on contour matching.

4 Scene Reconstruction

4.1 Geometry reconstruction

Several techniques have been developed for 3D scene reconstruction. For example, KinectFusion [34] applied frame-to-model alignment to fuse depth information and visualize 3D scenes in real time. However, KinectFusion tends to cause drift where depth maps are not accurately aligned due to accumulation of registration errors over time. Several attempts have been made to avoid drift and led to significant improvements in high-quality 3D reconstruction. For example, Xiao et al. [3] added object constraints to correct misaligned reconstructions. Zhou et al. [35], [4] split input frames into small chunks, each of which could be accurately reconstructed using a standard SLAM system like KinectFusion. An optimization was then performed to register all the chunks into the same coordinate frame. In robotics, SLAM systems also detect re-visiting places and trigger a loop closure constraint to enforce global consistency of camera poses.

In this work, we adopt the system in [4], [18] to calculate camera poses. Given the camera poses, the triangular mesh of a scene can be extracted using the marching cubes algorithm [1]. We also store the camera pose of each input frame for computing 3D-2D correspondences. The normal of each mesh vertex is given by the area-weighted average over the normals of its neighbor surfaces. We further smooth the resulting normals using a bilateral filter.

4.2 3D-2D Correspondence

Given the reconstructed 3D scene, we align the whole sequence of 2D frames with the 3D scene using the corresponding camera poses obtained from section 4.1

. For each vertex, the normal is computed directly on the 3D mesh and its color is estimated as the median of the color of the corresponding pixels on 2D frames.

5 Segmentation in 3D

After the reconstruction, a scene mesh typically consists of millions of vertices. In this stage, those vertices are segmented into much fewer regions. To achieve this, we first divide the reconstructed scene into a number of so-called supervertices by applying a purely geometry-based segmentation method. We then merge the supervertices into larger regions by considering both surface normals and colors. We keep all the supervertices and regions for later use. In addition, the hierarchical structures of the regions, supervertices, and mesh vertices (e.g. list of mesh vertices composing a supervertex) are also recorded.

5.1 Graph-based Segmentation

We extend the efficient graph-based image segmentation algorithm of Felzenszwalb et al. [36] to 3D space. Specifically, the algorithm operates on a graph defined by the scene mesh in which each node in the graph corresponds to a vertex in the mesh. Two nodes in the graph are linked by an edge if their two corresponding vertices in the mesh are the vertices of a triangle. Let be the set of vertices in the mesh. The edge connecting two vertices and is weighted as


where and are the unit normals of and respectively.

The graph-based segmenter in [36] employs a number of parameters including a smoothing factor used for noise filtering (normals in our case), a threshold representing the contrast between adjacent regions, and the minimum size of segmented regions. In our implementation, those parameters were set to 0.5, 500, and 20 respectively. However, we also make those parameters available to users for customization.

The graph-based segmentation algorithm results in a set of supervertices . Each supervertex is a group of geometrically homogeneous vertices with similar surface normals. The bottom left image in Fig. 2 shows an example of the supervertices. More examples can be found in Fig. 10 and Fig. 11.

5.2 MRF-based Segmentation

The graph-based segmentation often produces a large number (e.g. few thousands) of supervertices which could require considerable effort for annotation. To reduce this burden, the supervertices are clustered into regions via optimizing an MRF model. In particular, for each supervertex , the color and normal of , denoted as and , are computed as the means of the color values and normals of all vertices . Each supervertex is then represented by a node in an MRF. Two nodes and are directly connected if and share some common boundary (i.e. and are adjacent supervertices). Let be the label of , the unary potentials are defined as


where and are the Gaussians of the color values and normals of the label class of , and are the mean and covariance matrix of .

The pairwise potentials are defined as the Potts model [37]


Let be the set of labels of supervertices. The optimal labels is determined by


where is weight factor set to 0.5 in our implementation.

The optimization problem in (4) is solved using the method in [37]. In our implementation, the number of labels was initialized to the number of supervertices; each supervertex was assigned to a different label. Fig. 2 (bottom) shows the result of the MRF-based segmentation. More results of this step are presented in Fig. 10 and Fig. 11.

6 Segmentation Refinement and Annotation in 3D

The automatic segmentation stage could produce over- and under- segmented regions. To resolve these issues, we design three operations: merge, extract, and split.

Merge. This operation is used to resolve over-segmentation. In particular, users identify over-segmented regions that need to be grouped by stroking on them. The merge operation is illustrated in the first row of Fig. 3.

Extract. This operation is designed to handle under-segmentation. In particular, users first select an under-segmented region, the supervertices composing the under-segmented region are retrieved. Users then can select a few supervertices and use the merge operation to group those supervertices to create a new region. Note that the supervertices are not recomputed. Instead, they are retrieved from the cache result in the graph-based segmentation step. The second row of Fig. 3 shows the extract operation.


In a few rare cases, the MRF-based segmentation may perform differently on different regions. This is probably because of the variation of the geometric shape and appearance of objects. For example, a scene may have chairs appearing in a unique color and other chairs each of which composes multiple colors. Therefore, a unique setting of the parameters in the MRF-based segmentation may not adapt to all objects.

Fig. 3:
Fig. 4:

To address this issue, we design a split operation enabling user-guided MRF-based segmentation. Specifically, users first select an under-segmented region by stroking on that region. The MRF-based segmentation is then invoked on the selected region with a small value of (see (4)) to generate more grained regions. We then enforce a constraint such that the starting and ending point of the stroke belong to two different regions. For example, assume that and are the labels of two supervertices that respectively contain the starting and ending point of the stroke. To bias the objective function in (4), in (3) is set to when , and to a large value (e.g. ) otherwise. By doing so, the optimization in (4) would favor the case . In other words, the supervertices at the starting and ending point are driven to separate regions. Note that the MRF-based segmentation is only re-executed on the selected region. Therefore, the split operation is fast and does not hinder user interaction. The third row of Fig. 3 illustrates the split operation.

Through experiments we have found that most of the time, users perform merge and extract operations. Split operation is only used when extract operation is not able to handle severe under-segmentations but such cases are not common in practice. When all the 3D segmented regions have been refined, users can annotate the regions by providing the object type, e.g. coffee table, sofa chair. Fig. 4 shows an example of using our tool for annotation. Note that users are free to navigate the scene in both 3D and 2D space.

7 Object Search

There may exist multiple instances of an object class in a scene, e.g. the nine chairs in Fig. 6. To support labeling and annotating repetitive objects, users can define a template by selecting an existing region or multiple regions composing the template. Those regions are the results of the MRF-based segmentation or user refinement. Given the user-defined template, our system automatically searches for objects that are similar to the template. Note that the repetitive objects are not present as a single region. Instead, each repetitive object may be composed of multiple regions. For example, each chair in Fig. 6(a) consists of different regions such as the back, seat, legs. Once a group of regions is found to well match with the template, the regions are merged into a single object and recommended to users for verification. We extend the 2D shape context proposed in [33] to describe 3D objects (section 7.1). Matching objects with the template is performed via comparing shape context descriptors (section 7.2). The object search is then built upon the sliding-window object detection approach [38] (section 7.3).

7.1 Shape Context

Fig. 5:

Shape context was proposed by Belongie et al. [33] as a 2D shape descriptor and is well-known for many desirable properties such as being discriminative, robust to shape deformation and transformation, and less sensitive to noise and partial occlusions. Those properties fit well our need for several reasons. First, reconstructed scene meshes could be incomplete and contain noisy surfaces. Second, occlusions may also appear due to the lack of sufficient images completely covering objects. Third, the tool is expected to adapt with the variation of object shapes, e.g. chairs with and without arms.

In our work, a 3D object is represented by a set of vertices obtained from the 3D reconstruction step. For each vertex , the shape context of is denoted as and represented by the histogram of the relative locations of other vertices , , to . Let . The relative location of a vertex to is encoded by the length and the spherical coordinate of . In our implementation, the lengths were quantized into 5 levels. To make the shape context more sensitive to local deformations, were quantized in a log-scale space. The spherical angles were quantized uniformly into 6 discrete values. Fig. 5 illustrates the 3D shape context.

The shape context descriptor is endowed with scale-invariant by normalizing

by the mean of the lengths of all vectors. To make the shape context rotation invariant, Kortgen et al.

[39] computed the spherical coordinates

relatively to the eigenvectors of the covariance matrix of all vertices. However, the eigenvectors may not be computed reliably for shapes having no dominant orientations, e.g. rounded objects. In addition, the eigenvectors are only informative when the shape is complete while our scene meshes may be incomplete. To overcome this issue, we establish a local coordinate frame at each vertex on a shape using its normal and tangent vector. The tangent vector of a vertex

is the one connecting to the centroid of the shape. We have found this approach worked more reliably.

Since a reconstructed scene often contains millions of vertices, prior to applying the object search, we uniformly sample a scene by points which result in objects of vertices.

7.2 Shape Matching

Comparing (matching) two given shapes and is to maximize the correspondences between pairs of vertices on these two shapes, i.e. minimizing the deformation of the two shapes in a point-wise fashion. The deformation cost between two vertices and is measured by the distance between the two corresponding shape context descriptors extracted at and as follow,


where is the dimension (i.e. the number of bins) of , is the value of at the -th bin.

Given the deformation cost of every pair of vertices on two shapes and , shape matching can be solved using the shortest augmenting path algorithm [40]. To make the matching algorithm adaptive to shapes with different number of vertices, “dummy” vertices are added. This enables the matching method to be robust to noisy data and partial occlusions. Formally, the deformation cost between two shapes and is computed as,


where is identical to or augmented from by adding dummy vertices and is the matching vertex of determined by using [40].

To further improve the matching, we also consider how well the two matching shapes are aligned. In particular, we first align to using a rigid transformation. This rigid transformation is represented by a matrix and estimated using the RANSAC algorithm that randomly picks three pairs of correspondences and determine the rotation and translation [41]. We then compute an alignment error,




and, similarly for , where is the rigid transformation matrix and is a large value used to penalize misalignments.

A match is confirmed if: (i) and (ii) where and are thresholds. In our experiments, we set (meters), , . We have found that the object search method was not too sensitive to parameter settings while those settings achieved the best performance.

Fig. 6:

7.3 Searching

Object search can be performed based on the sliding-window approach [38]. Specifically, we take the 3D bounding box of the template and use it as the window to scan a 3D scene. At each location in the scene, all regions that intersect the window are considered for their possibility to be part of a matching object. However, it would be intractable to consider every possible combination of all regions. To deal with this issue, we propose a greedy algorithm that operates iteratively by adding and removing regions.

function GrowShrink
Input: set of regions to examine,
: window,
: user-defined template
Output: best matching object
        for to do
             // grow
              for do
                   if and then
                   end if
             end for
             // shrink
              for do
                   if and then
                   end if
             end for
       end for
Algorithm 1 Grow-shrink procedure. and are the matching cost and alignment error defined in (6) and (7).

The general idea is as follows. Let be the set of regions that intersect the window , i.e. the 3D bounding box of the template. For a region , we verify whether the object made by could be more similar to the user-defined template in comparison with . Similarly, for every region we also verify the object made by . These adding and removing steps are performed interchangeably in a small number of iterations until the best matching result (i.e. a group of regions) is found. This procedure is called grow-shrink and described in Algorithm 1.

In our implementation, the spatial strides on the

, , and direction of the window were set to the size of . The number of iterations in Algorithm 1 was set to , which resulted in satisfactory accuracy and efficiency.

Since a region may be contained in more than one window, it may be verified multiple times in multiple groups of regions. To avoid this, if an object candidate is found in a window, its regions will not be considered in any other objects and any other windows. Fig. 6 illustrates the robustness of the object search in localizing repetitive objects under severe conditions (e.g. objects with incomplete shape).

The search procedure may miss some objects. To handle such cases, we design an operation called guided merge. In particular, after defining the template, users simply select one of the regions of a target object that is missed by the object search. The grow-shrink procedure is then applied on the selected region to seek a better match with the template. Fig. 7 shows an example of the guided merge operation.

Fig. 7:

8 Segmentation on 2D

Segmentation on 2D can be done by projecting regions in 3D space onto 2D frames. However, the projected regions may not well align with the true objects on 2D frames (see Fig. 8). There are several reasons for this issue. For example, the depth and color images used to reconstruct a scene might not be exactly aligned at object boundaries; the camera intrinsics are from factory settings and not well calibrated; camera registration during reconstruction exhibits drift.

Fig. 8:
Fig. 9:

To overcome this issue, we propose an alignment algorithm which aims to fit the boundaries of projected regions to true boundaries on 2D frames. The true boundaries on a 2D frame can be extracted using some edge detector (e.g. the Canny edge detector [42]). Let denote the set of edge points on the edge map of a 2D frame. Let be the set of contour points of a projected object on that frame. is then ordered using the Moore neighbor tracing algorithm [43]. The ordering step is used to express the contour alignment problem in a form that dynamic programming can be applied for efficient implementation.

At each contour point , we consider a -pixel window centered at (in relative to a -pixel image). We then extract the histogram of the orientations of vectors , in the window. The orientations are uniformly quantized into 16 bins. We also perform this operation for edge points . The dissimilarity between the two local shapes at a contour point and edge point is computed as (similarly to (5)).

We also consider the continuity and smoothness of contours. In particular, the continuity between two adjacent points and is defined as . The smoothness of a fragment including three consecutive points , , is computed as where and denote the vectors connecting to and connecting to respectively, and is the cosine of the angle formed by these two vectors.

Alignment of to is to identify a mapping function that maps a contour point to an edge point so as to,


The optimization problem in (8) can be considered as the bipartite graph matching problem [40]. However, since is ordered, this optimization can be solved efficiently using dynamic programming [44]. In particular, denoting , , , the objective function in (8) can be rewritten as,


where and are user parameters. We have tried and with various values and found that and often produced good results.

In (10), for each contour point , all edge points are verified for a match. However, this exhausted search is not necessary since the misalignment only occurs at a certain amount. To save the computational cost, we limit the search space for each contour point by considering only its nearest edge points whose the distance to is less than a distance . In our experiment, was set to of the maximum of the image dimension, e.g., for a -pixel image. The number of nearest edge points (i.e. ) was set to 30. Fig. 8(c) shows an example of contour alignment by optimizing (10) using dynamic programming.

We have also verified the contribution of the continuity and smoothness. Fig. 9 shows the results when the cues are used individually and in combination. The results show that, when all the cues are taken into account, the contours are mostly well aligned with the true object boundaries. It is noticed that the seat of the green chair is not correctly recovered. We have found that this is because the Canny’s detector missed important edges on the boundaries of the chair. Users are also free to edit the alignment results.

9 Experiments

We present the dataset on which experiments were conducted in section 9.1. We evaluate the 3D segmentation in section 9.2. The object search is evaluated in section 9.3. Experimental results of the 2D segmentation are finally presented in section 9.4.

9.1 Dataset

Fig. 10:
Fig. 11:
Graph-based MRF-based User refined Interactive time
Scene #Vertices #Supervertices OCE #Regions OCE #Labels #Objects (in minutes)
copyroom 1,309,421 1,996 0.92 347 0.73 157 15 19
lounge 1,597,553 2,554 0.97 506 0.93 53 12 16
hotel 3,572,776 13,839 0.98 1433 0.88 96 21 27
dorm 1,823,483 3,276 0.97 363 0.78 75 10 15
kitchen 2,557,593 4,640 0.97 470 0.85 75 24 23
office 2,349,679 4,026 0.97 422 0.84 69 19 24
Our scenes 1,450,748 2,498 0.93 481 0.77 179 19 30
TABLE I: Comparison of the graph-based segmentation and MRF-based segmentation. For our captured scenes, the statistical data is the average numbers calculated over all the scenes. Note that for user refined results, the numbers of objects annotated are fewer than the numbers of labels (i.e. segments). This is because the annotation was done only for objects that are common in practice.

We created a dataset consisting of over 100 scenes. The dataset includes six scenes from publicly available datasets: the copyroom and lounge from the Stanford dataset [35], the hotel and dorm from the SUN3D [3], and the kitchen and office sequences from the Microsoft dataset [2]. The Stanford and SUN3D dataset also provide registered RGB and depth image pairs. These datasets also include the camera pose data.

In addition to existing scenes, we collected 100 scenes using Asus Xtion and Microsoft Kinect v2. Our scenes were captured from the campus of the University of Massachusetts Boston and the Singapore University of Technology and Design. These scenes were captured from various locations such as lecturer rooms, theatres, university hall, library, computer labs, dormitory, etc. All the scenes were then fully segmented and annotated using our tool. The dataset also includes the camera pose information. Fig. 10 and Fig. 11 show the six scenes collected from the public datasets and several of our collected scenes.

9.2 Evaluation of 3D Segmentation

We evaluated the impact of the graph-based and MRF-based segmentation on our dataset. We considered the annotated results obtained using our tool as the ground-truth. To measure the segmentation performance, we extended the object-level consistency error (OCE), the image segmentation evaluation metric proposed in

[45] to 3D vertices. Essentially, the OCE reflects the coincidence of pixels/vertices of segmented regions and ground-truth regions. As indicated in [45], compared with other segmentation evaluation metrics (e.g. the global and local consistency error in [46]), the OCE considers both over- and under-segmentation errors in a single measure. In addition, OCE can quantify the accuracy of multi-object segmentation and thus it fits well our evaluation purpose.

Table I summarizes the OCE of the graph-based and MRF-based segmentation. As shown in the table, compared with the graph-based segmentation, the segmentation accuracy is significantly improved by the MRF-based segmentation. It is also noticeable on the reduction of the quantity of the 3D vertices to the supervertices and the regions. However, experimental results also show that the segmentation results generated automatically are still not approaching the quality made by human beings. Thus, user interactions are necessary. This is because of two reasons. First, both the graph-based and MRF-based segmentation aim to segment a 3D scene into homogenous regions/surfaces rather than semantical objects. Second, the semantic segmentation done by users are subjective. For example, one may consider a pot and a plant growing on it as two separate objects or as a single object.

After user interaction, the number of final labels are typically less than a hundred. The number of semantic objects is around 10 to 20 in most of the cases. Note that the numbers of final labels and semantic objects are not identical. This is because there could have labels whose semantics is not well defined, e.g. miscellaneous items on a table or some small labels appeared as noise in the 3D reconstruction.

We also measured the time required for user interactions using our tool. This information is reported in the last column of Table I. As shown in the table, with the assistance of the tool, complex 3D scenes (with millions of vertices) could be completely segmented and annotated in less than 30 minutes, as opposed to approximately few hours to be done manually. Note that the interactive time is subjective to user’s experience. Several results of our tool on the public datasets and our collected dataset are shown in Fig. 10 and Fig. 11.

Through experiments we have found that although our tool was able to work with most reconstructed scenes in reasonable processing time, it failed at a few locally rough terrains, e.g. the 3D mesh outer boundaries and the pothole areas made by loop closure. Enhancing broken surfaces and completing missing object parts will be our future work.

9.3 Evaluation of Object Search

To evaluate the object search functionality, we collected a set of 45 objects from our dataset. Those objects were selected so that they are semantical and common in practice and their shapes are discriminative. For example, drawers of cabinets were not selected since they were present in flat surfaces which could be easily found in many structures, e.g. walls, pictures, etc. For each scene and each object class (e.g. chair), each object in the class was used as the template while the remaining repetitive objects of the same class were considered as the ground-truth. The object search was then applied to find repetitive objects given the template.

We used the precision, recall, and -measure to evaluate the performance of the object search. The intersection over union (IoU) metric proposed in [11] for object detection was used as the criterion to determine true detections and false alarms. However, instead of computing the IoU on the bounding boxes of objects as in [11], we entailed the IoU at point-level (i.e. 3D vertices from the mesh). This is because our aim is not only to localize repetitive objects but also to segment them. In particular, an object (a set of vertices) formed by the object search procedure is considered as true detection if there exists an annotated object in the ground-truth such that


where denotes the area; the value 0.5 is often used in object detection evaluation (e.g. [11]).

The evaluation was performed on every template. The precision, recall, and -measure () were then averaged over all evaluations. Table II shows the averaged precision, recall, and -measure of the object search. As shown, the tool can localize and segment 70% of repetitive objects with 69% precision and 65% -measure. We also tested the object search without considering the alignment error (i.e. in 7)). Experimental results show that, compared with the solely use of shape context dissimilarity score (i.e. in (6)), while the augmentation of alignment error could slightly incur a loss of the detection rate (about 2%), it largely improved the precision (from 22% to 69%). This led to a significant increase of the -measure (from 30% to 65%).

Our experimental results show that, the object search worked efficiently with templates represented by about 200 points. For example, for the scene presented in Fig. 6, the object search was completed within 15 seconds with a 150-point template and on a machine equipped by an Intel(R) Core(TM) i7 2.10 GHz CPU and 32 GB of memory. In practice, threads can be used to run the object search in the background while users are performing interactions.

Precision Recall -measure
Without alignment error 0.22 0.72 0.30
With alignment error 0.69 0.70 0.65
TABLE II: Performance of the proposed object search.

9.4 Evaluation of 2D Segmentation

We also evaluated the performance of the segmentation on 2D using the OCE metric. This experiment was conducted on the dorm sequence from the SUN3D dataset [3]. The dorm sequence contained 58 images in which the ground-truth labels were manually crafted and publicly available.

We report the segmentation performance obtained by projecting 3D regions onto 2D images and by applying the our alignment algorithm in Table III. The impact of the local shape, continuity, and smoothness is also quantified. As shown in Table III, the combination of the local shape, continuity, and smoothness achieves the best performance. We have visually found the alignment algorithm could make projected contours smoother and closer to true edges and this would be more convenient for users to edit 2D segmentation results.

Experimental results show our alignment algorithm worked efficiently. On the average, the alignment could be done in about 1 second for a -pixel frame.

Segmentation method OCE
Projection 0.57
Local shape 0.60
Local shape + Continuity 0.55
Local shape + Smoothness 0.55
Local shape + Continuity + Smoothness 0.54
TABLE III: Comparison of different segmentation methods.

10 Conclusion

This paper proposed a robust tool for segmentation and annotation of 3D scenes. The tool couples the geometric information from 3D space and color information from multi-view 2D images in an interactive framework. To enhance the usability of the tool, we developed assistive user-interactive operations that allow users to flexibly manipulate scenes and objects in both 3D and 2D space. The tool is also facilitated with automated functionalities such as scene and image segmentation, object search for semantic annotation.

Along with the tool, we created a dataset of more than 100 scenes. All the scenes were annotated using our tool. The newly created dataset was also used to verify the tool. The overall performance of the tool depends on the quality of 3D reconstruction. Improving the quality of 3D meshes by recovering broken surfaces and missing object parts will be our future work.


Lap-Fai Yu is supported by the University of Massachusetts Boston StartUp Grant P20150000029280 and by the Joseph P. Healey Research Grant Program provided by the Office of the Vice Provost for Research and Strategic Initiatives & Dean of Graduate Studies of the University of Massachusetts Boston. This research is supported by the National Science Foundation under award number 1565978. We also acknowledge NVIDIA Corporation for graphics card donation.

Sai-Kit Yeung is supported by Singapore MOE Academic Research Fund MOE2013-T2-1-159 and SUTD-MIT International Design Center Grant IDG31300106. We acknowledge the support of the SUTD Digital Manufacturing and Design (DManD) Centre which is supported by the National Research Foundation (NRF) of Singapore. This research is also supported by the National Research Foundation, Prime Minister’s Office, Singapore under its IDM Futures Funding Initiative.

Finally, we slenderly thank Fangyu Lin for assisting data capture and Guoxuan Zhang for the early version of the tool.


  • [1] H. Roth and M. Vona, “Moving volume kinectfusion,” in British Machine Vision Conference, 2012, pp. 1–11.
  • [2] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. W. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in RGB-D images,” in

    IEEE International Conference on Computer Vision and Pattern Recognition

    , 2013, pp. 2930–2937.
  • [3] J. Xiao, A. Owens, and A. Torralba, “SUN3D: A database of big spaces reconstructed using sfm and object labels,” in IEEE International Conference on Computer Vision, 2013, pp. 1625–1632.
  • [4] Q. Y. Zhou and V. Koltun, “Simultaneous localization and calibration: Self-calibration of consumer depth cameras,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2014, pp. 454–460.
  • [5] J. P. C. Valentin, S. Sengupta, J. Warrell, A. Shahrokni, and P. H. S. Torr, “Mesh based semantic modelling for indoor and outdoor scenes,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2013, pp. 2067–2074.
  • [6] C. Hane, C. Zach, A. Cohen, R. Angst, and M. Pollefeys, “Joint 3D scene reconstruction and class segmentation,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2013, pp. 97–104.
  • [7] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D shapeNets: A deep representation for volumetric shape modeling,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
  • [8] A. Gupta, S. Satkin, A. A. Efros, and M. Hebert, “From 3D scene geometry to human workspace,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2011, pp. 1961–1968.
  • [9] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme: A database and web-based tool for image annotation,” International Journal of Computer Vision, vol. 77, no. 1-3, pp. 157–173, 2008.
  • [10] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li, “Imagenet: A large-scale hierarchical image database,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
  • [11] M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.
  • [12] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2010, pp. 3485–3492.
  • [13] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A large data set for nonparametric object and scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 11, pp. 1958–1970, 2008.
  • [14] J. Deng, A. C. Berg, K. Li, and F. F. Li, “What does classifying more than 10, 000 image categories tell us?” in European Conference on Computer Vision, 2010, pp. 71–84.
  • [15] B.-S. Hua, Q.-H. Pham, D. T. Nguyen, M.-K. Tran, L.-F. Yu, and S.-K. Yeung, “Scenenn: A scene meshes dataset with annotations,” in International Conference on 3D Vision (3DV), 2016.
  • [16] X. Ren, L. Bo, and D. Fox, “RGB-(D) scene labeling: features and algorithms,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2012, pp. 2759–2766.
  • [17] S. Gupta, P. Arbelaez, and J. Malik, “Perceptual organization and recognition of indoor scenes from RGB-D images,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2013, pp. 564–571.
  • [18] S. Choi, Q.-Y. Zhou, and V. Koltun, “Robust reconstruction of indoor scenes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 5556–5565.
  • [19] M. Jancosek and T. Pajdla, “Multi-view reconstruction preserving weakly-supported surfaces,” in Proc IEEE International Conference on Computer Vision and Pattern Recognition, 2011, pp. 3121–3128.
  • [20] J. Valentin, V. Vineet, M. M. Cheng, D. Kim, J. Shotton, P. Kohli, M. Niessner, A. Criminisi, S. Izadi, and P. Torr, “Semanticpaint: Interative 3D labeling and learning at your fingertips,” ACM Transactions on Graphics, vol. 34, no. 5, pp. 1–16, 2015.
  • [21] O. Miksik, V. Vineet, M. Lidegaard, R. Prasaath, M. Nießner, S. Golodetz, S. L. Hicks, P. Pérez, S. Izadi, and P. H. S. Torr, “The semantic paintbrush: Interactive 3D mapping and recognition in large outdoor spaces,” in ACM Conference on Human Factors in Computing Systems, 2015, pp. 3317–3326.
  • [22] Z. Jia, A. C. Gallagher, A. Saxena, and T. Chen, “3D-based reasoning with blocks, support, and stability,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2013, pp. 1–8.
  • [23] D. Lin, S. Fidler, and R. Urtasun, “Holistic scene understanding for 3D object detection with RGBD cameras,” in IEEE International Conference on Computer Vision, 2013, pp. 1417–1424.
  • [24] B. Kim, P. Kohli, and S. Savarese, “3D scene understanding by voxel-crf,” in IEEE International Conference on Computer Vision, 2013, pp. 1425–1432.
  • [25] Y. S. Wong, H. K. Chu, and N. J. Mitra, “Smartannotator: An interactive tool for annotating RGBD indoor images,” Computer Graphics Forum, vol. 34, no. 2, pp. 447–457, 2015.
  • [26] Y. Wang, R. Ji, and S. F. Chang, “Label propagation from imagenet to 3D point clouds,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2013, pp. 3135–3142.
  • [27] Y. M. Kim, N. J. Mitra, D. M. Yan, and L. Guibas, “Acquiring 3d indoor environments with variability and repetition,” ACM Transactions on Graphics, vol. 31, no. 6, pp. 138:1–138:11, 2012.
  • [28] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. J. Kelly, and A. J. Davison, “SLAM++: Simultaneous localisation and mapping at the level of objects,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2013, pp. 1352–1359.
  • [29] L. Nan, K. Xie, and A. Sharf, “A search-classify approach for cluttered indoor scene understanding,” ACM Transactions on Graphics, vol. 31, no. 6, pp. 137:1–137:10, 2012.
  • [30] T. Shao, W. Xu, K. Zhou, J. Wang, D. Li, and B. Guo, “An interactive approach to semantic modeling of indoor scenes with an RGBD camera,” ACM Transactions on Graphics, vol. 31, no. 6, pp. 136:1–136:11, 2012.
  • [31] K. Chen, Y. K. Lai, Y. X. Wu, R. Martin, and S. M. Hu, “Automatic semantic modeling of indoor scenes from low-quality RGB-D data using contextual information,” ACM Transactions on Graphics, vol. 33, no. 6, pp. 208:1–208:11, 2014.
  • [32] Y. Zhang, W. Xu, Y. Tong, and K. Zhou, “Online structure analysis for real-time indoor scene reconstruction,” ACM Transactions on Graphics, 2015, to appear.
  • [33] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 4, pp. 509–522, 2002.
  • [34] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in IEEE ISMAR.   IEEE, October 2011.
  • [35] Q. Y. Zhou and V. Koltun, “Dense scene reconstruction with points of interest,” ACM Transactions on Graphics, vol. 32, no. 4, pp. 112:1–112:8, 2013.
  • [36] P. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image segmentation,” International Journal of Computer Vision, vol. 59, no. 2, pp. 167–181, 2004.
  • [37] S. A. Barker and P. J. W. Rayner, “Unsupervised image segmentation using markov random field models,” Pattern Recognition, vol. 33, no. 4, pp. 587–602, 2000.
  • [38] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893.
  • [39] M. Körtgen, G. J. Park, M. Novotni, and R. Klein, “3D shape matching with 3D shape contexts,” in Central European Seminar on Computer Graphics, Apr. 2003.
  • [40] R. Jonker and A. Volgenant, “A shortest augmenting path algorithm for dense and sparse linear assignment problems,” Computing, vol. 38, pp. 325–340, 1987.
  • [41] B. K. P. Horn, “Closed-form solution of absolute orientation using unit quaternions,” Journal of the Optical Society of America A, vol. 4, no. 4, pp. 629–642, 1987.
  • [42] J. Canny, “A computational approach to edge detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 8, no. 6, pp. 679–698, 1986.
  • [43] N. Narappanawar, B. M. Rao, and M. Joshi, “Graph theory based segmentation of traced boundary into open and close sub-sections,” Computer Vision and Image Understanding, vol. 115, no. 11, pp. 1552–1558, 2011.
  • [44] A. Thayananthan, B. Stenger, P. H. S. Torr, and R. Cipolla, “Shape context and chamfer matching in cluttered scenes,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2003, pp. 127–133.
  • [45] M. Polak, H. Zhang, and M. Pi, “An evaluation metric for image segmentation of multiple objects,” Image and Vision Computing, vol. 27, pp. 1123–1127, 2009.
  • [46] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating algorithms and measuring ecological statistics,” in Proc International Conference on Computer Vision, 2001, pp. 416–423.