The Replica Dataset: A Digital Replica of Indoor Spaces

06/13/2019 ∙ by Julian Straub, et al. ∙ 4

We introduce Replica, a dataset of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale. Each scene consists of a dense mesh, high-resolution high-dynamic-range (HDR) textures, per-primitive semantic class and instance information, and planar mirror and glass reflectors. The goal of Replica is to enable machine learning (ML) research that relies on visually, geometrically, and semantically realistic generative models of the world - for instance, egocentric computer vision, semantic segmentation in 2D and 3D, geometric inference, and the development of embodied agents (virtual robots) performing navigation, instruction following, and question answering. Due to the high level of realism of the renderings from Replica, there is hope that ML systems trained on Replica may transfer directly to real world image and video data. Together with the data, we are releasing a minimal C++ SDK as a starting point for working with the Replica dataset. In addition, Replica is `Habitat-compatible', i.e. can be natively used with AI Habitat for training and testing embodied agents.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

If the organism carries a “small scale model” of external reality and of its own possible actions within its head, it is able to try out various alternatives, conclude which is the best of them, react to future situations before they arise, utilize the knowledge of past events in dealing with the present and future, and in every way to react in a much fuller, safer, and more competent manner to the emergencies that face it.

Kenneth Craik [7] via Sutton and Barto [26]

Replicating real physical spaces in their full fidelity in a digital form is a longstanding goal across multiple areas in science and engineering. Digitizing real environments has many future use cases, such as virtual telepresence. The combination of replicas of real environments with powerful simulators such as AI Habitat [24] enables scalable machine learning that may yield models that can be directly deployed in the real world to perform tasks like embodied navigation [1], instruction following [2], and question answering [9]. Via parallelization, reality simulators enable faster-than-realtime and more scalable training of AI agents in comparison with training real robots in the wild. Additionally, simulation from Replica can be leveraged in egocentric computer vision, semantic segmentation in 2D and 3D and geometry inference. More realistic replicas lead to more realistic virtual telepresence, more accurate computation over them, and a smaller domain gap between simulation and reality.

Fig. 2: Replica ‘Turing Test’: One column shows the raw RGB images captured in these spaces, the other column shows renderings from Replica (from the same camera pose). Can you tell which column shows ‘real’ images and which column shows renderings? Find the answer in Sec. IV.

Datasets such as ImageNet 

[16], COCO [19], and VQA [3] have helped advance research in computer vision and multimodal AI problems. With the Replica dataset we aim to unlock research into AI agents and assistants that can be trained in simulation and deployed in the real world. The key distinction of Replica these image-based static datasets is that Replica scenes allow for active perception since the 3D assets allow generating views from anywhere inside the model. This enables the next generation of embodied AI tasks such as those studied in the AI Habitat platform [24]. Compared to other 3D datasets such as Matterport 3D [6] and ScanNet [8], Replica achieves significantly higher levels of realism – we encourage you to take the Replica Turing Test in Fig. 2. Moreover, Replica introduces high dynamic range (HDR) textures as well as renderable planar mirror and glass reflectors as can be seen in the comparison of raw RGB capture with renders from the model in Fig. 2. The Replica dataset contains 18 scenes of various real world environments. As shown in Fig. 1, we provide a dense mesh, high resolution and HDR textures, semantic class and instance annotation of each primitive, and glass and mirror reflectors. The Replica dataset includes a variety of scene types as well as a large range of object instances from 88 semantic classes to facilitate interesting machine learning tasks. It also contains 6 scans of the same indoor space with different furniture configurations that show different snapshots in time of the same space.

Replica Matterport 3D (MP3D) ScanNet Stanford 2D-3D-S Gibson

Scale

# scenes 18 90 1513 6 572
# rooms 35 2056 707 270 ?

Res.

color res. 92k 97k 20k MP3D MP3D
geometry res. k k k MP3D MP3D

Fidelity

HDR textures
reflectors

Labels

semantic classes 88 40 13 -
semantic annotation 3D Paint 3D Felsenszwalb 3D Felsenszwalb 3D -
TABLE I:

Comparison of reconstruction-based 3D scene datasets. We estimate color and geometry resolution for each dataset as the number of pixels and mesh primitives respectively per

. Note that for all metrics we used the meshes that were semantically annotated and report median values.

Replica

ScanNet

Matterport 3D

RGB

geometry

normals

class seg.

instance seg.
Fig. 3: Renderings from comparison datasets to give a qualitative comparison to the Replica dataset. Note that clean geometry is important to allow rendering clean semantic class and instance segmentation. Geometry and texturing artifacts are noticeable in both Matterport 3D and ScanNet. Additionally ScanNet scans show a lot of missing surfaces and often do not capture the full room.

Ii Related Work

Existing 3D datasets can be classified broadly into two categories: (1) human-generated synthetic scenes based on CAD models and (2) reconstructions of real environments. They vary in semantic and visual realism.

Ii-a Synthetic Scenes

SUNCG [25] is a large dataset of synthetic indoor environments. However, the scenes lack realistic appearances and are often semantically overly simplistic. SceneNet [14] is a synthetic dataset with 57 scenes and 3,699 object instances which can be automatically varied by sampling objects of the same class and similar size to replace the base objects in the 57 scenes. The Stanford Scenes [12] dataset consists of 130 scenes with 1,723 object instances. On the smaller scale with only 16 scenes but with more realistic appearance is the RobotriX dataset [13]. The InteriorNet [17] dataset consists of 22 interior environments created from 1 CAD assets. The dataset comes with 20 images rendered out from the environments for SLAM benchmarking and machine learning. While newer synthetic datasets like InteriorNet are becoming more and more realistic, they still are not capturing real spaces with all their imperfections due to use, clutter and semantic variety.

Ii-B Real Scenes

There exists multiple datasets of 3D reconstructions of rooms and houses that capture semantically realistic scenes as shown in the overview Table I. Based on Matterport’s indoor scanning system there is the Matterport3D dataset [6], the Gibson dataset [28], and the Stanford 2D-3D-S dataset [4], some of which capture hundreds of scenes. These scales are impressive for reconstruction-based 3D scene datasets as it takes effort to collect, process, clean up and semantically annotate real data. The visual quality of the Matterport-scanner-based datasets is more realistic than SUNCG but geometry artifacts and lighting problems exist throughout the datasets, as shown in Fig. 3.

The original Matterport3D [6] dataset consists of 90 houses with 2,056 rooms and 50,811 object instances from 40 semantic classes. Semantic annotation was performed based on a 3D Felsenszwalb pre-segmentation [11]. This means the resolution and accuracy of the semantic annotation is constrained to the segments extracted by the Felsenszwalb algorithm, which we found to be prone to inaccuracy on boundaries between objects. The Stanford 2D-3D-S dataset [4] contains 6 large-scale reconstructions with a total of 270 rooms. It is annotated with 13 object classes and 11 scene categories. The exact method of semantic annotation is not described except that it is done in 3D. The Gibson dataset [28] contains 572 buildings and includes the two aforementioned datasets. Only the meshes from the Matterport3D and the Stanford 2D-3D-S dataset contain semantic segmentations.

Beyond Matterport-scanner-based reconstructions, there is the ScanNet [8] dataset which was obtained by scanning scenes with an iPad-based RGB-D camera system. It contains 1,513 scenes with more than 19 scene types and a flexible yet unspecified number of semantic classes. Mapping of the semantic classes to NYU v2, ModelNet, ShapeNet and WordNet exists. Semantic annotation was performed based on a Felsenszwalb segmentation with the same downside of inaccurate segmentation boundaries as described previously.

Table I shows that while this initial release of Replica is a smaller dataset, its reconstructions have high color, geometry, and semantic resolution. Additionally, the Replica dataset introduces HDR textures and renderable reflectors.

Iii Dataset Creation

Fig. 4: The data collection rig used to capture the raw data used to build Replica.
Fig. 5: Example of holes filled with the mesh fix-up tool. Filled holes are marked red. In these images the textures have been tonemapped to a low dynamic range to facilitate easier human interpretation for manual touch up.

To create the Replica reconstructions, we use a custom built RGB-D capture rig with an IR projector depicted in Fig. 4. It collects time-aligned raw IMU, RGB, IR and wide-angle greyscale sensor data. The wide-angle greyscale video data together with the IMU data is used by an in-house SLAM system, similar to state-of-the-art systems like [10, 22]

, to provide 6 degree of freedom (DoF) poses. We compute raw depth from the IR video stream given the IR structured light pattern projected from the rig. Given the 6 DoF poses from the SLAM system, depth images are fused into a truncated signed distance function (TSDF) akin to KinectFusion 

[23]. Meshes are extracted using the standard Marching Cubes [20] algorithm, simplified via Instant Meshes [15] and textured with a PTex-like system [5]. Finally, we extract mirrors and reflective surfaces [27].

HDR textures are obtained by cycling the exposure times of the RGB texture camera and, using the 6 DoF SLAM poses, fusing the measured radiance per texel into 16 bit floating point RGB values. This approach yields an overall dynamic range of about 85,000:1 which corresponds to more than 16 f-stops as opposed to the standard vertex mesh colors and textures of the other datasets which are encoded as 8 bit RGB values.

Iii-a Mesh and Reflector Fixing

To ensure the highest quality 3D meshes, we manually fix planar reflective surfaces and small holes where surfaces were not sufficiently captured during scanning. Reflective surfaces are defined as planar polygons and can be annotated in our custom built software tool by specifying the boundary of the reflector on the mesh. For hole filling we first automatically detect holes by searching for boundary edges that form closed cycles and hence constitute holes. A human annotator can then use our tool to select a hole and automatically fill it using the approach described by Liepa [18]. Specifically, we use CGAL [21] to triangulate the hole boundary to generate an initial patch, then refine and smooth the patch. Examples of patched holes are shown in Fig. 5.

Iii-B Semantic Annotation

Semantic annotation is performed in two steps. First, we render a set of images from the mesh such that all primitives of the mesh are observed at least once. These images are then annotated in parallel using a 2D instance-level masking tool. After 2D annotation, we fuse the 2D semantic annotations back onto the mesh using a voting scheme. The 3D annotations are then refined using a superpixel-like segmentation. This ensures that small holes in the initial fused segmentation are filled based on neighborhood information. In the second step we review, refine and correct the fused segmentation using a 3D annotation tool that in effect allows painting on the 3D mesh. This step ensures highest annotation quality since annotations can be refined down to the primitive level.

As part of the semantic annotation we also annotate areas that need to be anonymized (i.e. blurred or pixelated) to ensure privacy.

We represent the semantic annotation as a multi-tree or forest data structure which we call a segmentation forest: At the bottom level are the individual primitives of the mesh. The next level connects primitives into larger segments. At the root level these segments are connected into semantic object entities. Figure 6 shows a simple example comprised of a chair and two book instances. As can be seen, the segmentation forest data structure represents an instance segmentation of the scene where each tree in the semantic annotation forest corresponds to a semantic instance. A class segmentation is obtained by simply rendering all instances of the same class in the same color. The segmentation forest data structure is flexible in that it allows connecting semantic instances in a hierarchical way. Rendering at different levels of the forest leads to different segmentations of the scene.

chair

seg

p9

seg

p1

p4

p5

book

seg

p8

p2

book

seg

p0
Fig. 6: In the proposed segmentation forest data structure, the root of each tree indicates the semantic object instance. The mesh primitives from the leaf nodes (denoted “p”) are connected into segmentation nodes (denoted “seg”) one level below the roots.

Iv Dataset Description

(a) apartment 0
(b) apartment 1
(c) apartment 2
(d) office 0
(e) office 1
(f) office 2
(g) office 3
(h) office 4
(i) room 0
(j) room 1
(k) room 2
(l) hotel 0
Fig. 7: The Replica dataset contains a variety of 12 semantically different reconstructions.
(a) FRL apartment 0
(b) FRL apartment 1
(c) FRL apartment 2
(d) FRL apartment 3
(e) FRL apartment 4
(f) FRL apartment 5
Fig. 8: The Replica dataset contains a set of 6 scenes of the FRL apartment with the contents rearranged mimicking the same scene at different points in time.
Fig. 9: Example renderings from the Replica dataset showing glass and mirror reflectors as well as high resolution textures.

The Replica dataset together with a minimal SDK are published at the following github repository: https://github.com/facebookresearch/Replica-Dataset.

As shown in Fig. 7 and 8, the Replica dataset contains 18 different scenes: 6 different setups of the FRL apartment, 5 office rooms, a 2-floor house, 2 multi-room apartment spaces, a hotel room, and 3 rooms of apartments. The scenes were selected with an eye towards semantic variety of the environments as well as their scale. With the 6 FRL apartment scenes with different setups we introduce a dataset of scenes taken at different points in time of the same space.

Each Replica scene contains dense geometry, high resolution HDR textures, reflectors and semantic class and instance annotation as shown for one of the datasets in Fig. 1. Figure 3 shows renderings from the FRL Apartment dataset for the different modalities. Note the high fidelity of the semantic annotations and the accuracy at borders.

As shown in Fig. 9 glass and mirror surface information is contained in the Replica dataset and can be rendered for additional realism and photometric accuracy.

In Fig. 2 we show comparisons of the raw RGB image captured from the data collection rig next to a rendering of the scene from same pose. Qualitatively, it is hard to tell whether the left or right frames are the raw captures underscoring the realism of the Replica reconstructions. Small artifacts and the fact that there is no motion blur give away that the right column shows the rendered images. Additionally, the foot of the operator is accidentally captured in the second example giving another hint that the left column contains the raw captured images.

Figure 10 shows a histogram over semantic instances across the dataset. The semantic classes were picked to capture the variety of objects and surface classes in Replica. The figure shows that common structural elements such as “floor”, “wall”, “ceiling” as well as various object types from “chair” to “book” and small entities such as “wall_plug”, “cup”, and “coaster” are included. While the number of classes is larger than in several common datasets a mapping to other class lists is straightforward.

Fig. 10: Histogram over the 88 semantic classes contained in the dataset.

We publish a minimal Replica C++ SDK with the dataset, that demonstrates how to render the Replica reconstructions. The SDK may be used to inspect the dataset and as a starting point for further development. For machine learning applications we recommend the use of the AI Habitat [24]

simulator which integrates with PyTorch and allows rendering from Replica directly into PyTorch Tensors for deep learning. The AI Habitat simulator supports rendering RGB, depth, semantic instance and semantic class segmentation images at up to 10

frames per second.

Iv-a Data Organization

Each Replica dataset scene contains the following data:

  • mesh.ply: quad mesh encoding the dense surface of the scene. Each vertex has a color value assigned to it for low resolution and non-HDR rendering of the scene (not recommended).

  • textures/*: high dynamic range PTex texture files.

  • glass.sur: file describing reflectors in the scene. It contains a list of reflector parameter objects. Each reflector is described by the transformation from world coordinates to the reflector plane, a polygon in the reflector plane, a surface normal and the reflectance value. A reflectance of signals a mirror and anything else a partially transparent glass surface.

  • semantic.json and semantic.bin: semantic segmentation of the reconstruction.

  • preseg.json and preseg.bin: planar/non-planar segmentation of the reconstruction.

  • habitat: data exported for use with AI Habitat.

    • mesh_semantic.ply: quad mesh with semantic instance ids for each primitive. The class of each instance can be looked up in the semantic.json file in the habitat folder.

    • mesh_semantic.navmesh: occupancy information needed for AI Habitat agent simulation.

    • semantic.json: mapping from a semantic instance id stored with every primitive in mesh_semantic.ply to the semantic class name.

The semantic.json and the preseg.json files represent a segmentation forest data structure by specifying a list of nodes with class names, a list of children and a parent field. Each node has a unique id and is addressed via this id. The corresponding semantic.bin and preseg.bin files contain the list of primitive ids corresponding to each node.

V Conclusion

The Replica dataset sets a new standard for texture, geometry and semantic resolution as well as quality for reconstruction-based 3D datasets. It introduces HDR textures and renderable reflector information. As such it enables AI agent and ML research that needs access to data beyond static datasets consisting of collections of images such as ImageNet and COCO. Furthermore, due to its realism, it can serve as a generative model for benchmarking 3D perception systems such as SLAM and dense reconstruction systems as well as to facilitate research into AR and VR telepresence.

Acknowledgments

The Replica dataset would not have been possible without the hard work and contributions of Matthew Banks, Christopher Dotson, Rashad Barber, Justin Blosch, Ethan Henderson, Kelley Greene, Michael Thot, Matthew Winterscheid, Robert Johnston, Abhijit Kulkarni, Robert Meeker, Jamie Palacios, Tony Phan, Tim Petrvalsky, Sayed Farhad Sadat, Manuel Santana, Suruj Singh, Swati Agrawal, and Hannah Woolums.

References

  • [1] Peter Anderson, Angel X. Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir Roshan Zamir. On evaluation of embodied navigation agents. arXiv:1807.06757, 2018.
  • [2] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018.
  • [3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In ICCV, 2015.
  • [4] Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3D semantic parsing of large-scale indoor spaces. In CVPR, 2016.
  • [5] Brent Burley and Dylan Lacewell. Ptex: Per-face texture mapping for production rendering. In Computer Graphics Forum, volume 27, pages 1155–1164. Wiley Online Library, 2008.
  • [6] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. In 3DV, 2017. https://niessner.github.io/Matterport/.
  • [7] Kenneth J. W. Craik. The Nature of Explanation. Cambridge University Press, 1943.
  • [8] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, 2017. http://www.scan-net.org/.
  • [9] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied Question Answering. In CVPR, 2018.
  • [10] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry. TPAMI, 40(3):611–625, 2017.
  • [11] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient graph-based image segmentation. IJCV, 59(2):167–181, 2004.
  • [12] Matthew Fisher, Daniel Ritchie, Manolis Savva, Thomas Funkhouser, and Pat Hanrahan. Example-based synthesis of 3D object arrangements. In ACM SIGGRAPH Asia, 2012.
  • [13] Alberto Garcia-Garcia, Pablo Martinez-Gonzalez, Sergiu Oprea, John Alejandro Castro-Vargas, Sergio Orts-Escolano, Jose Garcia-Rodriguez, and Alvaro Jover-Alvarez. The robotrix: An extremely photorealistic and very-large-scale indoor dataset of sequences with robot trajectories and interactions. In IROS, pages 6790–6797. IEEE, 2018.
  • [14] A Handa, V Patraucean, V Badrinarayanan, S Stent, and R Cipolla. Scenenet: understanding real world indoor scenes with synthetic data. arxiv preprint (2015). arXiv preprint arXiv:1511.07041, 2015.
  • [15] Wenzel Jakob, Marco Tarini, Daniele Panozzo, and Olga Sorkine-Hornung. Instant field-aligned meshes. ACM Transactions on Graphics, 34(6), November 2015.
  • [16] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton.

    ImageNet classification with deep convolutional neural networks.

    In NIPS, 2012.
  • [17] Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, and Stefan Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset. In BMVC, 2018.
  • [18] Peter Liepa. Filling holes in meshes. In ACM SIGGRAPH Symposium on Geometry Processing, pages 200–205, 2003.
  • [19] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
  • [20] William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3D surface construction algorithm. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’87, pages 163–169, New York, NY, USA, 1987. ACM.
  • [21] Sébastien Loriot, Jane Tournois, and Ilker O. Yaz. Polygon mesh processing. In CGAL User and Reference Manual. CGAL Editorial Board, 4.14 edition, 2019.
  • [22] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. ORB-SLAM: a versatile and accurate monocular SLAM system. TRO, 31(5):1147–1163, 2015.
  • [23] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In 2011 IEEE International Symposium on Mixed and Augmented Reality, pages 127–136. IEEE, 2011.
  • [24] Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. arXiv preprint arXiv:1904.01201, 2019.
  • [25] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In CVPR, 2017.
  • [26] R. S. Sutton and A. G. Barto. An adaptive network that constructs and uses an internal model of its world. Cognition and Brain Theory, 1981.
  • [27] Thomas Whelan, Michael Goesele, Steven J. Lovegrove, Julian Straub, Simon Green, Richard Szeliski, Steven Butterfield, Shobhit Verma, and Richard Newcombe. Reconstructing scenes with mirror and glass surfaces. ACM Transactions on Graphics (TOG), 37(4):102, 2018.
  • [28] Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. In CVPR, 2018. http://gibsonenv.stanford.edu/database/.