Rescan: Inductive Instance Segmentation for Indoor RGBD Scans

09/25/2019 ∙ by Maciej Halber, et al. ∙ 9

In depth-sensing applications ranging from home robotics to AR/VR, it will be common to acquire 3D scans of interior spaces repeatedly at sparse time intervals (e.g., as part of regular daily use). We propose an algorithm that analyzes these "rescans" to infer a temporal model of a scene with semantic instance information. Our algorithm operates inductively by using the temporal model resulting from past observations to infer an instance segmentation of a new scan, which is then used to update the temporal model. The model contains object instance associations across time and thus can be used to track individual objects, even though there are only sparse observations. During experiments with a new benchmark for the new task, our algorithm outperforms alternate approaches based on state-of-the-art networks for semantic instance segmentation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the proliferation of RGBD cameras, 3D data is now more widely available than ever before [10, 25, 8]. As depth capturing devices become smaller and more affordable, and as they operate in everyday applications (AR/VR, home robotics, autonomous navigation, etc.), it is plausible to expect that 3D scans of most environments will be acquired on a daily basis. We can expect that 3D reconstructions of many spaces, visited at different times and captured from different viewpoints, will be available in the future, just like photographs are today.

In this paper, we investigate how repeated, infrequent scans captured with handheld RGBD cameras can be used to build a spatio-temporal model of an interior environment, complete with object instance semantics and associations across time. The challenges are that: 1) each RGBD scan captures the environment from different viewpoints, possibly with noisy data; and 2) scans separated by long time intervals (once per day, every Tuesday, etc.) can have large differences due to object motion, entry, or removal. Thus simple algorithms that perform object detection individually for each scan and/or simply cluster object detections and poses in space-time will not solve the problem. Moreover, since large training sets are not available for this task, it is not practical to train a neural network to solve it.

We propose an inductive algorithm that infers information about new RGBD capture of a scene from a temporal model obtained from previous observations of (fig. 1). The input to the algorithm is the model , representing all previous scans and a novel scene scan . The output is an updated model that describes the set of objects appearing in the scene and an arrangement of those objects at each time step, including the most recent. At every iteration, our algorithm optimizes for the arrangement of objects in , and then uses to infer the semantic instance segmentation of . Segmentation of is then used to update object set (see fig. 2).

To evaluate our algorithm we present a novel benchmark dataset that contains temporally consistent ground-truth semantic instance labels, describing object associations across time within each scene. Experiments with this benchmark suggest that our proposed optimization strategy is superior to alternative approaches based on deep learning for semantic and instance segmentation tasks.

Overall, the contributions of the paper are three-fold:

  • A system for building a spatio-temporal model for an indoor environment from infrequent scans acquired with hand-held RGBD cameras,

  • An inductive algorithm that jointly infers the shapes, placements, and associations of objects from infrequent RGBD scans by utilizing data from past scans,

  • A benchmark dataset with rescans of 13 scenes acquired at 45 time-steps in total, along with ground-truth annotations for object instances and associations across time.

2 Related Work

Most work in computer vision on RGBD scanning of dynamic scenes has focused on tracking

[43] and reconstruction [36]. For example, Newcombe et al. [36] showcases a system where multiple observations of a deforming object are fused into a single consistent reconstruction. Yan et al. [48] scan moving articulated shapes by tracking parts as they are deformed over time. These methods differ from ours as they require observation of motions as they occur.

For sparse temporal observations, early work in robotics focuses on the analysis of 2D maps created from 1D laser range sensors [3, 5, 19]. For example, Biswas [5] used 1D laser data to detect objects within a scene and associate them across time. However, their method relies upon 2D algorithms and assumes that object instances cannot overlap across time, which makes it inapplicable in our setting. More recently, image based techniques for sparse observations were proposed — Shin [42] extends SfM to also predict poses of moving objects.

Other work has aimed at life-long scene understanding using data captured with actively controlled sensors

[15, 29, 39, 49]. For example, several algorithms proposed in the STRANDS project [23] process the scenes observed from a repeated set of views [1, 6, 41]. Others focus on controlling camera trajectories to acquire the best views for object modeling [13, 15] and/or change detection [2]. These problems are different than ours, as we focus on analyzing previously acquired RGBD data captured without a specifically tailored robotic platform and active control.

Some work in computer vision has focused on change detection and segmentation of dynamic objects in RGBD scans [16, 31, 46]. For example, Fehr et al. [16]

showcases a system for using multiple scene observations to classify surface elements as dynamic or static. Wang et al.

[47] detect moving objects so that they can be removed from a SLAM optimization. Lee et al. [31] propose a probabilistic model to isolate temporally varying surface patches to improve camera localization. While operating on RGBD captures from handheld devices, these methods do not produce instance-level semantic segmentations, nor do they generate associations between objects across time.

More recent work has focused on automatic clustering of 3D points into clusters across space and time [17, 24]. For example, Herbst et al. [24] jointly segments multiple RGBD scans with a joint MRF formulation. Finman et al. [17] detects clusters of points from pairwise scene differencing and associates new detections with previous observations. Although similar in spirit to our formulation, these methods operate only on clusters of points, without semantics, and thus are not suited for applications that require semantic understanding of how objects move across space-time.

Finally, many projects have considered temporal modeling of environments in specific application domains. For example, several systems in civil engineering track changes to a Building Information Model (BIM) by alignment to 3D scans acquired at sparse temporal intervals [20, 26, 37, 45]. They generally start with a specific building design model [22], construction schedule [44], and/or object-level CAD models [7], and thus are not as general as our approach. The Scene Chronology project [35] and others [34, 40] build temporal models of cities from image collections – however, they do not recover a full 3D model with temporal associations of object instances as we do.

3 Algorithm

3.1 Scene Representation

Our system represents a scene at time with a temporal model comprising a tuple , where is a list of object instances that have appeared within this or any prior observation for , and is a list of object arrangements estimated for each observation . Each object instance is represented by , where is unique instance id, is the object’s geometry, and is the semantic class. Each arrangement is a list of poses , where . is the unique id of -th object and function returns index to . is a transformation that moves geometry into correct location within the scene . Lastly is a matching score quantifying how well matches the geometry of .

3.2 Overview

Our algorithm updates the temporal model in an inductive fashion. Given the previous model and a new scan , we predict a new model (see fig. 2) by executing four consecutive steps. The first proposes potential poses for objects in (sec. 3.3

). The second performs a combinatorial optimization to find the arrangement

that maximizes a new objective function jointly accounting for geometric fit and temporal coherence (sec. 3.4). The third step uses and to infer an instance-level semantic segmentation of . The fourth step updates the geometry of each object by aggregating its respective segment from . The following four subsections offer the details on how each of these steps is implemented.

3.3 Object Pose Proposal

The first step of our pipeline is to find a set of potential placements for each object , creating a search space for the Arrangement Optimization stage (sec. 3.4). Formally, the input to this stage is a set of objects and a scan . The output is a set of scored pose lists for each object . A scored pose is a tuple , where is the proposed rigid-body transformation and is a geometric matching score describing how well pose aligns to the geometry of .

Finding transformations that align surfaces and is a longstanding problem in computer graphics and vision [38]. In our setting, we wish to find a set of poses for the surface with good alignment with surface , where and . Prior work usually attempts to solve similar problems by employing feature-based methods. Such methods sub-sample the two surfaces to obtain a set of meaningful keypoints and then match them to produce a plausible pose (e.g., using Point-Pair Feature matching[12]). However, as it has been noted in other domains, keypoints may limit the amount of information a method considers, with dense matching methods leading to less failures [14].

Following this intuition, we propose a dense matching procedure, where we slide each of the objects across the scene, perform an ICP optimization at each of the discrete locations and compute a matching score based on the traditional point-to-plane distance metric [33].

This approach might seem counter-intuitive, as a naive implementation of such grid-search would lead to a prohibitive run-time performance. We find however that such an approach can be made acceptably fast while leading to much better recovery of correct poses. To speed-up the run-time performance of our method we make use of the multi-resolution approach. We compute a four-level hierarchy for the input point cloud (the geometries ), with minimum distance between any two points at a level equal to respectively. To compute this representation we follow an algorithm described in [9]. Multi-resolution representation allows us to perform the dense search only on the coarsest level of the hierarchy, and return a subset of poses with sufficiently high scores to be verified at higher levels, leading to significant performance gains. Additionally, we make a simplifying, but reasonable assumption that objects in our scenes move on the ground plane and rotate around the gravity direction.

With this approach we are able to produce a set of pose lists for each object in . The advantage of this dense grid-search method is that it produces sets of poses that contain most of the true candidate locations, even if the local geometry of might be different from due to reconstruction errors. We showcase the comparison to keypoint based methods [12, 4] in figure 3.

Figure 3: Comparison of the precision/recall scores obtained for all scenes in our database, comparing PPF matching [4] to our method. In our experiments a pose of an object is considered a true positive if the distance between object centers is less than and object’s classes agree.

3.4 Arrangement Optimization

In the second step our algorithm selects a subset of poses from the previous step to form an object arrangement. The input is a set of objects , a set of pose lists for each object , and the scan . The output is an arrangement that describes a global configuration of objects which maximizes the objective.

This problem statement leads to a discrete, combinatorial optimization. First reason for choosing this approach is that the number of objects within the scene is not known a priori. A combinatorial approach allows us to propose arrangements of variable lengths, that will adapt to the contents of . A second reason is that finding the optimum requires global optimization – the placement of one object can greatly affect the placement of another. Additionally, deep learning is hard to apply in this instance due to the lack of the training data, as well as the non-linearity of the proposed objective function.

3.4.1 Objective Function

To quantify the quality of the candidate arrangement we use the objective function that is a linear combination of the following four terms:

Coverage Term
Geometry Term
Intersection Term
Hysteresis Term

Each term produces a scalar value that describes the quality of w.r.t. that specific term. We use grid search to find good values for the weights , which express the relative importance of each term.

The Coverage term measures the percentage of the scene that is covered by objects in . The intuition behind this term is that every part of the scene should ideally be explained by some object in . takes as input a scene and the candidate arrangement . To compute we voxelize both the scene and the objects in , resulting in two 3D grids and . The is calculated as the number of cells that are equal in both grids, over the number of cells in - . For this formula to be accurate we need to ensure however that we only voxelize the dynamic parts of the scene . As such we deactivate any cells in that belong to the static parts of the scene, like walls and floor, which can easily be detected with a method like RANSAC [18]. The inset figure above showcases a visualization of both grids (blue cells) and (white cells). As seen there, the covers the non-static parts of the scene only, leading to being a good estimate of the coverage.

The Geometry term is a measure of the geometrical agreement between the scene and objects in the candidate arrangement . We include this term to guide the objective function to select objects that best match the geometry of the scene at a specific location. This value is simply computed as an average of scores from the procedure described in section 3.3. , where returns the geometrical score fit for placement of object .

The Intersection term aims to estimate how much a pair of objects in the arrangement interpenetrate. Intuitively, such interpenetration would mean that two objects occupy the same physical location, which implies an impossible configuration. In our approach, we compute a coarse approximation of this term. First, we compute a covariance matrix of each . Covariances for each object allow us to compute a symmetric Mahalanobis distance between objects to approximately quantify how close they are to each other. , where , are transformed centroids of , , the midpoint between them is , and function is the Mahalanobis distance. With computed for all pairs of objects , the value is . The rationale behind the use of the infinity norm is to generate a high penalty if just a single pair of objects exhibits a low score interpenetration. The inset figure above showcases a visualization of for two intersecting objects. The point at which we evaluate the is marked with red, showcasing high values in regions where either or both objects are present, and low values in the free space. It is also clear that the value of would be higher if the objects interpenetrated even more.

The Hysteresis term informs how well the current arrangement estimate resembles a previously observed arrangements from the set . In addition it expresses our preference for a minimal relative motion. Each object in is assigned a score, with the value based on whether is a novel instance, or has been observed in the past. In the former case, we assign a novel object constant score (found manually). In the latter, the score is . is a function that applies the appropriate transformation to centroid at time . As a result, novel objects will be always preferred, unless they have undergone a significant transformation. In such a case, we would like

to express that novel object appearances have similar probability. The value of

is computed as an average of the above scores. The inset figure above illustrates an arrangement at and two possible arrangement estimates at . The form of encourages the selection of middle arrangement as it does not contain significant motion the sofa and chairs.

3.4.2 Optimization

To find arrangement , we employ a combination of greedy initialization and simulated annealing. We begin by greedily selecting an object at a pose which improves objective the most. This process of addition is continued until the objective function starts decreasing. After this stage, we perform simulated annealing optimization. We run the simulated annealing for 25k iterations, using a linear cooling schedule with a random restarts ( probability to return to the best scoring state). To explore the search space we use the following actions with a randomly selected object :

  • Add Object - We add at a random pose to .

  • Remove Object - We remove from .

  • Move Object - We select from and assign it new pose .

  • Swap Objects - We swap the location of and , another randomly selected object of the same semantic class.

3.5 Segmentation Transfer

The third step of the algorithm transfers the semantic and instance labels from to scan . The estimated arrangement from the previous step can be used to perform segmentation transfer, as we have semantic class and instance id associated with each object in . Using the estimated pose for each of the objects in , we transform its geometry to align with . We then perform a nearest neighbor lookup (with a maximum threshold

to account for outliers) and use the associations to copy both the instance and semantic labels from objects in

to . Since there is no guarantee that all points in will have a neighbor within the threshold , we follow-up the lookup with label smoothing based on multi-label graph-cut [11].

3.6 Geometry Fusion

The final step of the algorithm is to update the object geometries for objects in . To do so for each object , we extract the sub point clouds from that were assigned instance label in the previous step, and then we concatenate them with to generate new point cloud . In the idealized case, the two surfaces would be identical, as they represent the same object. However, due to partial observation, reconstruction, and alignment errors, we cannot expect that in practice. As such, we solve for a mean surface that minimizes the distance to all points in the , using Poisson Surface Reconstruction [27]. After this process, we uniformly sample points on the resulting surface to get a new estimate of that will be used for matching when a new scene needs to be processed.

4 Evaluation

Figure 4: Inductive instance segmentation results. Given a segmentation at time , our method is able to iteratively transfer instance labels to future times, even when the number of the objects in the scene changes.

Evaluation of the proposed algorithm is not straightforward, as there is little to no prior work directly addressing instance segmentation transfer between 3D scans.

Dataset: To evaluate the proposed approach, we have created a dataset of temporally varying scenes. Our dataset contains 13 distinct scenes, with total of 45 separate reconstructions. Each scene contains between 3 to 5 scans, where objects within each catpure were moved to simulate changes occuring across long time periods. Along with the captured data, we also provide manually-curated semantic category and instance labels for every object in every scene. The instance labels are stable across time, providing associations between object instances in different scans, which we can use to evaluate our algorithms. Additionally, we provide permutations of instance assignments for each scene to account for cases where objects’ motion is ambiguous and multiple arrangements can be considered correct. More details about the dataset are included in the supplemental material.

Metrics: We evaluate our approach using three metrics. The first is the Semantic Label metric that measures the correctness of class labels – it is implemented in the same way as the semantic segmentation task in the ScanNet Benchmark [10] and is reported as mean class IoU. The second is the Semantic Instance metric that measures the correctness of the object instance separations – it again comes from the ScanNet Benchmark [10] and is reported as mean Average Precision (IoU=0.5). Third, we propose a novel Instance Transfer metric, which specifically requires an agreement of instance indices across time. This metric is reported as mean IoU, where we count the number of points in both ground truth and prediction that share equivalent instance id. The Instance Transfer metric is much more challenging, as it requires associating objects with specific instance ids in different scans.

Baseline: Given the success of the recent deep models for the scene understanding (as shown on the leaderboard of [10]), it is interesting to compare the results of our algorithm to the best available method based on deep neural networks. One of the best available methods for 3D instance segmentation is MASC [32], which is based on semantic segmentation with SparseConvNet [21]. To test these methods on our tasks, we pre-trained the SparseConvNet and MASC models on ScanNet’s training set. We performed fine-tuning of MASC with the ground-truth labels of first observation (time ) of each scene in our database. This fine-tuned model provides instance segmentation, which can be combined with the Hungarian method [30] to estimate instance associations across time. This sequence of steps provides a very strong baseline combining state-of-the-art methods for instance segmentation with an established algorithm for assignment.

4.1 Quantitative Results

Method
Semantic
Label
Semantic
Instance
Instance
Transfer
SparseConvNet 0.203  -  -
MASC 0.310 0.291 0.175
MASC (fine-tuned) 0.737 0.562 0.345
[.4pt/1pt] Rescan 0.859 0.837 0.650
Table 1: Comparison of our method to SparseConvNet [21] and MASC [32]. SparseConvNet does not produce instance labels, hence we omit reporting on the Semantic Instance and Instance Transfer task, and only fine-tune MASC.
Figure 5: Qualitative comparison on the semantic segmentation task. Proposed method is able to provide high quality semantic labels as a result of instance segmentation transfer. Compared to competing methods, ours is able to produce better per object labels and does not confuse object classes.

Evaluation and comparison: Since we solve an inductive task (predict the answer at , given an answer at ), it is not obvious how to initialize the system for our experiments. As our aim is to evaluate the inductive step alone, we chose to initialize time with a correct instance segmentation. That choice avoids confounding problems with de novo instance segmentation at with the main objective of the experiment. We have each algorithm in the experiment transfer the instance segmentation from to , then transfer the result to , and so on.

We ran this experiment for our method in direct comparison to the baseline. Results for all three evaluation metrics are shown in Table

1. They show that our algorithm significantly outperforms competing methods. As expected, we see that the deep neural networks trained on the ScanNet training set [10] do not perform very well on our data without fine-tuning. After fine-tuning on the data in , they do much better. Fine-tuning allows for a fair comparison, as both their and our methods have access to the same information from to predict labels for . Despite this, instance segmentation on later time steps still performs worse than our algorithm, and instance associations across time are poor. We attribute the difference to the fact that our method is instance-centric, where the segmentation is inferred from the estimated objects’ arrangement. This is in stark opposition to methods like MASC, where the instances are inferred from a semantic segmentation.

Ablation studies: Second, we present the results of ablation studies that showcase the influence of various terms in our objective function on the results in a specific task. As seen in table 2, by far the most important term of our proposed objective is the Coverage Term. Without it, the objective function is discouraged from adding more objects. The optimization simply finishes with a single object added to the scene - as adding any more would lead to a decrease in other terms.

The second most important term, especially for the Instance Transfer task, is the Hysteresis Term. It is intuitive that lacking this term, the objective function is not encouraged to find an arrangement that will be consistent with previous object configurations. We note that when omitting this term, the semantic segmentation task achieves a slightly better result. The reason is that to prevent addition of superfluous objects the novel objects are assigned relatively low score (sec. 3.4.1). Without the Hysteresis Term, the proposed objective is free to insert additional objects - however their configuration is often not correct, leading to lower scores for other two tasks. This result suggests that there exists a better formulation of the hysteresis function - an interesting direction for future research.

The presence of the Intersection Term is important for the Semantic Instance and Instance Transfer tasks. Intuitively, the semantic segmentation score is unaffected as it is often the case that intersecting objects share the semantic class. The Geometry Term has the least influence on the results. This is not surprising, as the poses that survived the pose proposal stage (see sec. 3.3) were high scoring ones.

4.2 Qualitative Results

Method
Semantic
Label
Semantic
Instance
Instance
Transfer
No Coverage Term 0.061 0.058 0.048
No Geometry Term 0.853 0.825 0.617
No Intersection Term 0.859 0.781 0.584
No Hysteresis Term 0.870 0.818 0.226
[.4pt/1pt] Full Method 0.859 0.837 0.650
Table 2: Ablation study showcasing the influence of objective function terms on each of the proposed tasks.

Inductive segmentation transfer: We showcase qualitative results for the Instance Transfer task using our method in figure 4. Again, in this task we use the ground-truth segmentation provided by the user at and transfer it to all other observations sequentially. The results of such segmentation transfer offer stable and well-localized instances. Even over multiple time-steps, our method is able to keep track of objects identities, providing us with information on their location and motion. Additionally, thanks to the fact that the objective function prefers minimal change, we are able to deal with challenging configurations. For example in 4a our method is able to correctly recover three coffee tables at time

, despite their proximity and visual similarity.

Semantic segmentation: Figure 5 showcases qualitative comparisons between our method and DNN-based methods [32, 21]. Without fine-tuning, the segmentation issues are obvious. Learned methods confuse labels like sofa and chair, which explains low scores in table 1. Fine-tuning helps reduce these effects - however we also see some overfitting errors. Our method is able to recover high quality semantic segmentation, where due to the fact that our approach is instance-centric, a single instance can not have more than a single semantic class. Our method’s success is however dependent on the overlap between current and previous observations of . When lots of novel objects appear, the Hysteresis Term might discouraging addition of all of them, as it aims to produce arrangement similar to previously observed ones (fig. 5a).

Figure 6: Model completion results. The left column shows two scans of a scene with moving objects. The right column shows our reconstruction of the scene using objects and locations from the temporal model .

Model completion results: Our method for aggregating the observations of moving objects from multiple time steps allows it to produce more complete surface reconstructions than would be possible otherwise. Many other systems explicitly remove moving objects before creating a surface model (to avoid ghosting) [28]. Our approach uses the estimated object segmentations and transformations to aggregate points associated with each object to form a that is generally more complete than could be obtained from any one scan. Composing the aggregated using transformations in each object arrangement provides a model completion result (fig. 6).

Figure 7: Failure modes of the proposed method. (a) Partial scanning prevents the pose proposal stage from generating plausible poses. (b) Small objects contribute little to the coverage term. If such objects undergo significant motion our algorithm might miss them. (c) When similar, partially scanned objects are considered, our method might not produce the correct permutation.

Failures: We identify three main failure modes of our approach (fig. 7). The first issue arises due to the geometry focused nature of our approach. If the objects are only partially scanned, the pose proposal stage will not be able to recover highly scored poses. As such, these objects will simply not be added to the space of possible configurations that the optimization can choose from. The second is caused by the limited contribution of small objects to the scene coverage score. Combined with a small Hysteresis Term value under significant motion, the objective function might prefer not adding these objects. Lastly, in cases like the one in figure 7c, an incorrect permutation of objects might have a higher objective value than the ground truth one. This effect is a combination of Geometry Term providing noisy scores for partial scans of visually similar objects (like the chairs around the table), and their relative spatial proximity, which makes the Hysteresis Term a poor discriminator.

5 Conclusion

This paper presents an algorithm for estimating the semantic instance segmentation for an RGBD scan of an indoor environment. The proposed algorithms is inductive – using a temporal scene model which subsumes previous observations, an instance segmentation of the novel observation is inferred and used to update the temporal model. Our experiments show better performance on a novel benchmark dataset in comparison to a strong baseline. Interesting directions for future work include inferring the segmentation at , investigating RNN architectures (when larger datasets become available), and replacing terms of the objective function with learned alternatives.

Acknowledgments

We would like to thank Angel X. Chang and Manolis Savva for insightful discussions. We also thank Graham et al. [21] and Liu et al. [32] for the comparison codes, and Dai et al. for the ScanNet data [10]. The project was partially supported by funding from the NSF (CRI 1729971 and VEC 1539014/1539099).

References

  • [1] R. Ambruş, N. Bore, J. Folkesson, and P. Jensfelt (2014-Sept) Meta-rooms: building and maintaining long term spatial models in a dynamic world. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. , pp. 1854–1861. External Links: Document, ISSN 2153-0858 Cited by: §2.
  • [2] R. Ambrus, J. Ekekrantz, J. Folkesson, and P. Jensfelt (2015) Unsupervised learning of spatial-temporal models of objects in a long-term autonomy scenario. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pp. 5678–5685. Cited by: §2.
  • [3] D. Anguelov, R. Biswas, D. Koller, B. Limketkai, and S. Thrun (2002) Learning hierarchical object maps of non-stationary environments with mobile robots. In

    Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence

    ,
    pp. 10–17. Cited by: §2.
  • [4] T. Birdal and S. Ilic (2015-10)

    Point pair features based object detection and pose estimation revisited

    .
    In 2015 International Conference on 3D Vision, Vol. , pp. 527–535. External Links: Document, ISSN Cited by: Figure 3, §3.3.
  • [5] R. Biswas, B. Limketkai, S. Sanner, and S. Thrun (2002) Towards object mapping in non-stationary environments with mobile robots. In Intelligent Robots and Systems, 2002. IEEE/RSJ International Conference on, Vol. 1, pp. 1014–1019. Cited by: §2.
  • [6] N. Bore, J. Ekekrantz, P. Jensfelt, and J. Folkesson (2017) Detection and tracking of general movable objects in large 3d maps. arXiv preprint arXiv:1712.08409. Cited by: §2.
  • [7] F. Bosche, C. T. Haas, and B. Akinci (2009) Automated recognition of 3d cad objects in site laser scans for project 3d status visualization and performance control. Journal of Computing in Civil Engineering 23 (6), pp. 311–318. Cited by: §2.
  • [8] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV). Cited by: §1.
  • [9] M. Corsini, P. Cignoni, and R. Scopigno (2012-06) Efficient and flexible sampling with blue noise properties of triangular meshes. IEEE Transactions on Visualization and Computer Graphics 18 (6), pp. 914–924. External Links: Document, ISSN 1077-2626 Cited by: §3.3.
  • [10] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes. In

    Proc. Computer Vision and Pattern Recognition (CVPR), IEEE

    ,
    Cited by: §1, §4.1, §4, §4, Acknowledgments.
  • [11] A. Delong, A. Osokin, H. N. Isack, and Y. Boykov (2012-01-01) Fast approximate energy minimization with label costs. International Journal of Computer Vision 96 (1), pp. 1–27. External Links: ISSN 1573-1405, Document, Link Cited by: §3.5.
  • [12] B. Drost, M. Ulrich, N. Navab, and S. Ilic (2010-06) Model globally, match locally: efficient and robust 3d object recognition. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. , pp. 998–1005. External Links: Document, ISSN 1063-6919 Cited by: §3.3, §3.3.
  • [13] J. Ekekrantz, N. Bore, R. Ambrus, J. Folkesson, and P. Jensfelt (2016) Towards an adaptive system for lifelong object modelling. ICRA Workshop: AI for Long-term Autonomy. Cited by: §2.
  • [14] J. Engel, T. Schöps, and D. Cremers (2014-09) LSD-SLAM: large-scale direct monocular SLAM. In European Conference on Computer Vision (ECCV), Cited by: §3.3.
  • [15] T. Fäulhammer, R. Ambruş, C. Burbridge, M. Zillich, J. Folkesson, N. Hawes, P. Jensfelt, and M. Vincze (2017) Autonomous learning of object models on a mobile robot. IEEE Robotics and Automation Letters 2 (1), pp. 26–33. Cited by: §2.
  • [16] M. Fehr, F. Furrer, I. Dryanovski, J. Sturm, I. Gilitschenski, R. Siegwart, and C. Cadena (2017) TSDF-based change detection for consistent long-term dense reconstruction and dynamic object discovery. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 5237–5244. Cited by: §2.
  • [17] R. Finman, T. Whelan, L. Paull, and J. J. Leonard (2014) Physical words for place recognition in dense rgb-d maps. In ICRA workshop on visual place recognition in changing environments, Cited by: §2.
  • [18] M. A. Fischler and R. C. Bolles (1981-06) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24 (6), pp. 381–395. External Links: ISSN 0001-0782, Link, Document Cited by: §3.4.1.
  • [19] G. Gallagher, S. S. Srinivasa, J. A. Bagnell, and D. Ferguson (2009) GATMO: a generalized approach to tracking movable objects. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pp. 2043–2048. Cited by: §2.
  • [20] M. Golparvar-Fard, F. Pena-Mora, and S. Savarese (2012) Automated progress monitoring using unordered daily construction photographs and ifc-based building information models. Journal of Computing in Civil Engineering 29 (1), pp. 04014025. Cited by: §2.
  • [21] B. Graham, M. Engelcke, and L. van der Maaten (2018) 3D semantic segmentation with submanifold sparse convolutional networks. CVPR. Cited by: §4.2, Table 1, §4, Acknowledgments.
  • [22] K. Han and M. Golparvar-Fard (2015-06) BIM-assisted structure-from-motion for analyzing and visualizing construction progress deviations through daily site images and bim. pp. 596–603. External Links: Document Cited by: §2.
  • [23] N. Hawes, C. Burbridge, F. Jovan, L. Kunze, B. Lacerda, L. Mudrova, J. Young, J. Wyatt, D. Hebesberger, T. Kortner, R. Ambrus, N. Bore, J. Folkesson, P. Jensfelt, L. Beyer, A. Hermans, B. Leibe, A. Aldoma, T. Faulhammer, M. Zillich, M. Vincze, E. Chinellato, M. Al-Omari, P. Duckworth, Y. Gatsoulis, D. C. Hogg, A. G. Cohn, C. Dondrup, J. Pulido Fentanes, T. Krajnik, J. M. Santos, T. Duckett, and M. Hanheide (2017-Sep.) The strands project: long-term autonomy in everyday environments. IEEE Robotics Automation Magazine 24 (3), pp. 146–156. External Links: Document, ISSN 1070-9932 Cited by: §2.
  • [24] E. Herbst, P. Henry, and D. Fox (2014) Toward online 3-d object segmentation and mapping. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pp. 3193–3200. Cited by: §2.
  • [25] B. Hua, Q. Pham, D. T. Nguyen, M. Tran, L. Yu, and S. Yeung (2016) SceneNN: a scene meshes dataset with annotations. In International Conference on 3D Vision (3DV), Cited by: §1.
  • [26] K. Karsch, M. Golparvar-Fard, and D. Forsyth (2014) ConstructAide: analyzing and visualizing construction sites through photographs and building models. ACM Transactions on Graphics (TOG) 33 (6), pp. 176. Cited by: §2.
  • [27] M. Kazhdan, M. Bolitho, and H. Hoppe (2006) Poisson surface reconstruction. In Proceedings of the Fourth Eurographics Symposium on Geometry Processing, SGP ’06, Aire-la-Ville, Switzerland, Switzerland, pp. 61–70. External Links: ISBN 3-905673-36-3, Link Cited by: §3.6.
  • [28] M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich, and A. Kolb (2013-06) Real-time 3d reconstruction in dynamic scenes using point-based fusion. In Proceedings of Joint 3DIM/3DPVT Conference (3DV), pp. 8. External Links: Document Cited by: §4.2.
  • [29] T. Krajník, J. P. Fentanes, J. M. Santos, and T. Duckett (2017) Fremen: frequency map enhancement for long-term mobile robot autonomy in changing environments. IEEE Transactions on Robotics 33 (4), pp. 964–977. Cited by: §2.
  • [30] H. W. Kuhn (1955) The hungarian method for the assignment problem. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §4.
  • [31] M. Lee and C. C. Fowlkes (2017-10) Space-time localization and mapping. In The IEEE International Conference on Computer Vision (ICCV), pp. 3932–3941. Cited by: §2.
  • [32] C. Liu and Y. Furukawa (2019) MASC: multi-scale affinity with sparse convolution for 3d instance segmentation. External Links: 1902.04478 Cited by: §4.2, Table 1, §4, Acknowledgments.
  • [33] K. Low (2004) Linear least-squares optimization for point-toplane icp surface registration. Technical report University of North Carolina at Chapel Hill. Cited by: §3.3.
  • [34] R. Martin-Brualla, D. Gallup, and S. M. Seitz (2015) Time-lapse mining from internet photos. ACM Transactions on Graphics (TOG) 34 (4), pp. 62. Cited by: §2.
  • [35] K. Matzen and N. Snavely (2014) Scene chronology. In Proc. European Conf. on Computer Vision (ECCV), Cited by: §2.
  • [36] R. Newcombe, D. Fox, and S. Seitz (2015) DynamicFusion: reconstruction and tracking of non-rigid scenes in real-time. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: §2.
  • [37] D. Rebolj, Z. Pučko, N. Č. Babič, M. Bizjak, and D. Mongus (2017) Point cloud quality requirements for scan-vs-bim based automated construction progress monitoring. Automation in Construction 84, pp. 323–334. Cited by: §2.
  • [38] S. Rusinkiewicz and M. Levoy (2001-05) Efficient variants of the icp algorithm. In Proceedings Third International Conference on 3-D Digital Imaging and Modeling, Vol. , pp. 145–152. External Links: Document, ISSN Cited by: §3.3.
  • [39] J. M. Santos, T. Krajník, and T. Duckett (2017) Spatio-temporal exploration strategies for long-term autonomy of mobile robots. Robotics and Autonomous Systems 88, pp. 116–126. Cited by: §2.
  • [40] G. Schindler, F. Dellaert, and S. B. Kang (2007-06) Inferring temporal order of images from 3d structure. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1–7. External Links: Document, ISSN 1063-6919 Cited by: §2.
  • [41] D. Schulz and W. Burgard (2001) Probabilistic state estimation of dynamic objects with a moving mobile robot. Robotics and Autonomous Systems 34 (2-3), pp. 107–115. Cited by: §2.
  • [42] Y. M. Shin, M. Cho, and K. M. Lee (2013-11) Multi-object reconstruction from dynamic scenes: an object-centered approach. Comput. Vis. Image Underst. 117 (11). External Links: Link, Document Cited by: §2.
  • [43] S. Song and J. Xiao (2013) Tracking revisited using rgbd camera: unified benchmark and baselines. In Proceedings of the IEEE international conference on computer vision, pp. 233–240. Cited by: §2.
  • [44] Y. Turkan, F. Bosche, C. T. Haas, and R. Haas (2012) Automated progress tracking using 4d schedule and 3d sensing technologies. Automation in Construction 22, pp. 414–421. Cited by: §2.
  • [45] S. Tuttas, A. Braun, A. Borrmann, and U. Stilla (2017) Acquisition and consecutive registration of photogrammetric point clouds for construction progress monitoring using a 4d bim. PFG–Journal of Photogrammetry, Remote Sensing and Geoinformation Science 85 (1), pp. 3–15. Cited by: §2.
  • [46] C. Wang, C. Thorpe, and S. Thrun (2003) Online simultaneous localization and mapping with detection and tracking of moving objects: theory and results from a ground vehicle in crowded urban areas. In Robotics and Automation, 2003. Proceedings. ICRA’03. IEEE International Conference on, Vol. 1, pp. 842–849. Cited by: §2.
  • [47] C. Wang and C. Thorpe (2002) Simultaneous localization and mapping with detection and tracking of moving objects. In Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE International Conference on, Vol. 3, pp. 2918–2924. Cited by: §2.
  • [48] F. Yan, A. Sharf, W. Lin, H. Huang, and B. Chen (2014-07) Proactive 3d scanning of inaccessible parts. ACM Trans. Graph. 33 (4), pp. 157:1–157:8. External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • [49] J. Young, V. Basile, M. Suchi, L. Kunze, N. Hawes, M. Vincze, and B. Caputo (2017) Making sense of indoor spaces using semantic web mining and situated robot perception. pp. 299–313. Cited by: §2.