Efficient 3D Reconstruction and Streaming for Group-Scale Multi-Client Live Telepresence

08/08/2019 ∙ by Patrick Stotko, et al. ∙ University of Bonn 0

Sharing live telepresence experiences for teleconferencing or remote collaboration receives increasing interest with the recent progress in capturing and AR/VR technology. Whereas impressive telepresence systems have been proposed on top of on-the-fly scene capture, data transmission and visualization, these systems are restricted to the immersion of single or up to a low number of users into the respective scenarios. In this paper, we direct our attention on immersing significantly larger groups of people into live-captured scenes as required in education, entertainment or collaboration scenarios. For this purpose, rather than abandoning previous approaches, we present a range of optimizations of the involved reconstruction and streaming components that allow the immersion of a group of more than 24 users within the same scene - which is about a factor of 6 higher than in previous work - without introducing further latency or changing the involved consumer hardware setup. We demonstrate that our optimized system is capable of generating high-quality scene reconstructions as well as providing an immersive viewing experience to a large group of people within these live-captured scenes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Work

Telepresence applications for sharing live experiences rely on real-time 3D scene capture. For this purpose, the underlying scene representation, where the scene is reconstructed based on the fusion of the incoming sensor data, is of particular importance. Well-established representations include surface modeling in the form of implicit truncated signed distance fields (TSDFs). Early real-time volumetric reconstruction approaches 

[9, 21] are based on storing the scene model in a uniform grid. This results in high memory requirements as the data structure is not adapted according to the local presence of a surface. To improve the scalability to large-scale scenes, further work exploited the sparsity in the TSDF representation, e.g. based on moving volume techniques [26, 31], representing scenes in terms of blocks of volumes that follow dominant planes [8] or storing TSDF values only near the actual surface areas [1, 23, 12]. The individual blocks can be managed using tree structures or hash maps as proposed by Nießner et al. [23] and respective optimizations [12, 13, 25]. Furthermore, the replacement of the TSDF representation by a high-resolution binary voxel grid has also been considered by Reichl et al. [25] to improve the scalability and reduce the memory requirements. Recent extensions include the detection of loop closures [11, 2, 16] to reduce drift artifacts in camera localization as well as multi-client collaborative acquisition and reconstruction of static scenes [6].

This progress in real-time capturing enabled the development of various telepresence applications. Early telepresence systems [9, 17, 18, 19, 10, 5] were designed for room-scale environments and faced the problems of a limited reconstruction quality due to high sensor noise and a reduced resolution. Relying on an expensive capturing setup with several cameras, GPUs and desktop computers, the Holoportation system [24] was designed for high-quality real-time reconstruction of a dynamic room-scale environment based on the Fusion4D system [3] as well as real-time data transmission. This has been complemented with AR/VR systems to allow immersive end-to-end teleconferencing. In contrast, interactive telepresence for individual remote users within live-captured static scenes has been addressed by Mossel and Kröter [20] based on voxel block hashing [23, 12]. The limitations of this system regarding high bandwidth requirements, the immersion of only a single remote user into the captured scenarios as well as network interruptions leading to loss of scene parts that are reconstructed in the meantime have been overcome in the recent SLAMCast system [27]. However, the scalability to immersing large groups of people into on-the-fly captured scenes has not been achieved so far. In this paper, we directly address this problem by several modifications to the major components involved in telepresence systems.

2 System Outline

Akin to previous work, we build our scalable multi-client telepresence system on top of a volumetric scene representation in terms of voxel blocks, i.e. blocks of voxels. This approach has been well-established by previous investigations in the context of real-time reconstruction [9, 21, 30, 23, 1, 31, 22, 12, 13, 11, 2] and telepresence [20, 24, 27]. As shown in Figure 1, current state-of-the-art telepresence systems involving live-captured scenarios rely on the core components of (1) a real-time 3D reconstruction process, (2) a central server process as well as (3) exploration clients. RGB-D images captured by a single camera are streamed to the reconstruction process that runs on a cloud server and allows on-the-fly camera localization and scene capture via volumetric fusion. The reconstructed scene data is then passed to the central server process that manages a bandwidth-optimized version of the global model as well as the streaming of these data according to requests by connected exploration clients. Each exploration client integrates the transmitted scene parts into a locally generated mesh that can be interactively explored with VR devices on their local computers. In the following, we focus on the extension of such live telepresence systems to the immersion of larger groups of remote users into a live-captured scene at the example of the SLAMCast system [27]. This requires the optimization of the reconstruction (see section 3) and the central server processes (see section 4). In contrast, the exploration client receives the compressed and optimized scene representation and is already capable of providing an immersive viewing experience at the remote user’s site.

3 Optimization of the 3D Reconstruction Process

Figure 2: General volumetric 3D reconstruction pipeline. Our set of efficient filters designed to improve the performance and scalability of the state-of-the-art live telepresence systems can also be applied to the components of standalone 3D reconstruction (highlighted).

Since our optimizations are not particularly restricted to the reconstruction process used in the SLAMCast system, we show their application to volumetric 3D reconstruction approaches in general and provide an overview of the respective pipeline (see Figure 2). Here, the surface is represented in terms of implicit truncated signed distance fields (TSDFs) and stored as a sparse unordered set of voxel blocks using spatial hashing [23, 12, 11, 2, 25]

. Input to the reconstruction pipeline is an incremental stream of RGB-D images which is processed in an online fashion. First, the current RGB-D frame is preprocessed where camera-specific distortion effects are removed and a normal map is computed from the depth data. Afterwards, the current camera pose is estimated either using frame-to-model tracking 

[9, 21, 23, 12, 29, 31, 25] (as also used in the SLAMCast system) or using bundle adjustment for globally-consistent reconstruction [11, 2]. Using this pose, non-visible voxel block data are streamed out to CPU memory whereas visible blocks in CPU memory are streamed back into GPU memory [23, 12, 25, 16]. In the next step, new voxel blocks are allocated in the volume and the RGB-D data are fused into the volumetric model. Finally, a novel view depicting the current state of the reconstruction is generated using raycasting to provide a live feedback to the user during capturing.

3.1 Image Preprocessing

We improve the robustness of the acquired RGB-D data by filtering potentially unreliable data from the depth map. A further benefit of this operation is the resulting more compact scene model representation. Inspired by previous work [31], we discard samples located on stark depth discontinuities by considering the deviations to the depth values in a neighborhood . Due to the limited resolution and the overall noise characteristics of the sensor, such samples are likely to be outliers and might largely deviate from the true depth values. We extend this filter by further discarding samples with a significant amount of missing data in their local neighborhood. In such regions, which may not only contain depth discontinuities, the depth measurements are also susceptible of being unreliable. Thus, we consider the set

(1)

as outliers where and are user-defined thresholds, denotes the neighborhood of the depth sample and the set of neighboring pixels with no valid depth data. These outliers affect the overall reconstruction quality as well as the model compactness.

3.2 Data Fusion

Although potentially unreliable data around stark depth discontinuities have been filtered out during the preprocessing step, there are still samples, e.g. around small discontinuities, that do not contribute to the reconstruction and negatively affect the model compactness and streaming performance. In the voxel block allocation step, these unreliable data unnecessarily enlarge the global truncation region around the unknown surface since all voxel blocks located within the local truncation region around the respective depth samples are considered during allocation. Traditional approaches tried to remove these blocks afterwards using a garbage collection [23] which requires a costly analysis of the voxel data. In contrast, we propose a novel implicit filter which reduces the amount of unnecessary block allocations. By considering only every -th pixel per column and row, where is a user-defined control parameter, the depth image is virtually downsampled and the likelihood for an over-sized global truncation region is significantly reduced. Furthermore, this reduces the number of processed voxels during data fusion which greatly speed-ups the reconstruction and reduces the amount of blocks that are later queued for streaming to the server. Note that this downsampling is only performed during allocation whereas the whole depth image is still used for data fusion to employ TSDF-based regularization. In the context of globally-consistent 3D reconstruction using bundle-adjusted submaps [11, 6], our filter improves the compactness of the respective submap into which the RGB-D data are fused whereas the fusion of the more compact submaps into a single global model would be performed as in previous work.

3.3 Model Visualization

In order to provide a decent live preview of the current model state, the generation of such model views should preserve all the relevant scene information while suppressing noise as much as possible. Furthermore, if frame-to-model tracking is used to estimate the current camera pose, this is also crucial to allow a robust alignment. We propose a Marching Cubes (MC) voxel block pruning approach which will be described in more detail in section 4, as it has been carefully designed for the central server process. Here, we show an adaption of this contribution to standalone volumetric 3D reconstruction where the model is stored implicitly using TSDF voxels. Each TSDF voxel stores a TSDF value and fusion weight (both compressed using 16-bit linear encoding [12]) as well as a 24-bit color . Inspired by the garbage collection approach of point-based reconstruction techniques [14], we ignore TSDF voxels for raycasting and triangle generation which are currently considered unstable. These voxels contain only very few, possibly unreliable observations from the input data, so their fusion weight falls below a user-defined threshold :

(2)

However, in contrast to previous garbage collection approaches [23, 14], we do not remove these voxel blocks but only ignore them. This avoids accidental removal of blocks that might become stable at a future time when this scene part is also partially stored in a different submap or revisited by the user or another client in multi-client acquisition setups [11, 6]. Furthermore, by ignoring unstable data, the raycasted view will also be consistent with the exploration client’s version of the 3D model.

4 Optimization of the Central Server Process

1:Received TSDF voxel block positions and voxel data
2:Voxel block position list for updating the stream sets
3:
4:
5:
6:
7:
8:
9:
10:
11:
Algorithm 1 Our optimized server voxel block data integration

Beyond optimizations in the reconstruction process, the scalability of a live telepresence system also relies on the optimization of its central server process that takes care of managing the reconstructed global scene model as well as the stream states and requests by connected exploration clients. In this regard, we show respective optimizations at the example of the recently published SLAMCast system [27]. In comparison to the standard voxel block data integration at the server side, we propose a further filtering step which discards empty or unstable voxel blocks that contain only very few or none observations from the input RGB-D image data. This significantly improves the streaming performance and scalability and allows the immersion of groups of people. The individual steps of our optimized integration approach are shown in Algorithm 1.

Similar to the original SLAMCast system, we first integrate the TSDF voxel block positions and voxel data into the global TSDF voxel block model of the central server process. Afterwards, we update the global MC voxel block model which is optimized for streaming and stores a Marching Cubes index as well as a 24-bit color in each MC voxel. For this purpose, we create the set of MC voxel block positions requiring an update as well as a set of flags and the respective MC voxel data by performing the Marching Cubes algorithm on the corresponding TSDF voxels [15]. The flags indicate whether a block will generate reliable triangles and are constructed by analyzing the Marching Cubes indices of the MC voxels as well as the fusion weight of the corresponding TSDF voxels. Therefore, the following set of voxels either does not contain surface information in terms of triangles or would generate unstable triangle data:

(3)

We only allocate those blocks in the MC voxel block model that are flagged and prune blocks that are currently not flagged. This minimizes the amount of scene data that are streamed to the exploration clients. Finally, we integrate the generated MC voxel data of the flagged blocks. We do not prune the TSDF voxel block model which would otherwise lead to potential artifacts, i.e. missing geometry at block boundaries, since currently empty blocks might be needed for future updates.

In contrast to the MC voxel block model, pruning the list of updated MC voxel block positions in the same way would introduce artifacts at the exploration client side since they may already have received a previous version of blocks that have been pruned in the meantime. To properly handle updates, we manage the update set containing all voxel block positions that were considered for streaming in the past. We generate the list of updated MC voxel blocks by only considering the ones which either generated triangles in the past or with the current update. Finally, after the MC voxel blocks have been integrated into the volume and the list of updated block positions has been generated, we update the set by inserting all currently integrated voxel block positions.

5 Evaluation

We tested our highly scalable telepresence system on a variety of different datasets and analyzed several aspects such as system scalability, streaming latency and visual quality. For a quantitative comparison of the proposed contributions, we considered the following variants of our system:

  • [leftmargin=1em]

  • Base (B): Our 3D reconstruction and streaming system with deactivated filtering contributions, yielding equivalent performance to SLAMCast [27].

  • Base + Depth Discontinuity Filter (B+DDF): The base approach with an additional depth map filtering at discontinuities with = 0.25, = 0.2m (see subsection 3.1).

  • Base + Voxel Block Allocation Downsampling (B+VBAD): The base approach with an additional virtual downsampling at the voxel block allocation stage with = 4 (see subsection 3.2).

  • Base + MC Voxel Block Pruning (B+MCVBP): The base approach with an additional pruning of empty MC voxel blocks at the server side with = 2.0 (see subsection 3.3 and section 4).

  • Ours: Our approach incorporating all filtering contributions.

The filter sizes and thresholds as described above were determined empirically using several datasets. For validation, we used different real-world datasets recorded with an ASUS Xtion Pro (lounge, copyroom[32] and a Kinect v2 (heating_room, pool[27] as well as synthetic data (lr kt2 with simulated noise) [7]. Throughout the experiments, we used three computers where each of them takes the role of one part of the telepresence system, i.e. 3D reconstruction process (RC), central server process (S) and exploration client (EC). All computers were equipped with an Intel Core i7-4930K CPU and 32GB RAM and a NVIDIA GTX 1080 GPU with 8GB VRAM and connected via a local network. We replaced the exploration client by a benchmark client which starts requesting voxel blocks with a fixed frame rate of 100Hz when the reconstruction process starts. Furthermore, the reconstruction process uses a fixed reconstruction speed of 30Hz matching the framerate of the used datasets. We set the voxel size to 5mm as well as the truncation region to 60mm and used hash map/set sizes of and buckets as well as GPU and CPU voxel block pool sizes of and blocks, thereby following previous work [27].

5.1 System Scalability

Approach Dataset Max. ECs Request Rate [Hz] Model Size [# MC Voxel Blocks]
B lounge 5 100 314
copyroom 9 50 228
heating_room 3 100 850
pool 5 100 590
lr kt2 1 200 834
B+DDF lounge 9 50 270
copyroom 10 50 230
heating_room 9 50 443
pool 10 50 379
lr kt2 10 50 227
B+VBAD lounge 8 50 264
copyroom 5 100 226
heating_room 4 100 622
pool 8 50 446
lr kt2 1 200 550
B+MCVBP lounge 21 12 47 / 51 (314)
copyroom 9 50 65 / 75 (298)
heating_room 8 25 120 / 127 (850)
pool 13 25 104 / 108 (590)
lr kt2 7 25 64 / 64 (834)
Ours lounge 25 12 44 / 47 (240)
copyroom 18 25 57 / 67 (202)
heating_room 27 12 90 / 94 (352)
pool 28 12 95 / 99 (317)
lr kt2 26 12 53 / 55 (201)
Table 1: Maximum number of exploration clients (ECs) that the server can handle without any delay compared to a single client. Instead of using different package sizes with a fixed request rate of 100Hz, we use a fixed size of 512 and vary the rate accordingly to demonstrate the highest possible scalability. If empty MC voxel block pruning is used (B+MCVBP and Ours), the sizes of the TSDF and MC voxel block models differ and we list both ().
Figure 3: Streaming progress and latency between server (S) and exploration client (EC) over time for the heating_room dataset using our full system. Left: Absolute model sizes for the highest and lowest chosen package size. Right: Relative size differences between S and EC (w.r.t. model size S MC and update set size S MC US).
Figure 4: Streaming progress and latency between server (S) and exploration client (EC) over time for the heating_room dataset for each system variant. Left: Absolute model sizes. Right: Relative size differences between S and EC.

In this section, we will evaluate the scalability of our system in comparison to the baseline SLAMCast approach (see Table 1). In contrast to the following evaluations, the benchmark client discards the received data which allows for running all benchmark clients on a single computer without an overhead. Furthermore, rather than lowering the package size, we used a fixed package size of 512 voxel blocks and lower the request rate accordingly. This significantly reduced the constant overheads of kernel calls and memory copies and introduces only a minimal delay in the range of milliseconds which made it the preferred setting for handling a large number of clients. For an appropriate choice of the streaming rate, we determined the lowest package size which still allows the benchmark client to retrieve the whole model with an acceptable delay of at most one second (see supplemental material for a detailed analysis). Then, we measured the maximum number of benchmark clients that the server could handle without introducing a further delay. While the original SLAMCast system was only able to handle around 3-5 clients in general, both filters at the reconstruction side (B+DDF and B+VBAD) raised this limit to up to 10 clients. Since there is a tracking loss at the end of the copyroom sequence resulting in a slightly higher delay, a higher request rate was chosen and the scalability decreased accordingly. Although the number of MC voxel blocks is significantly lower after pruning (B+MCVBP), we observed that the general performance is similar to the depth discontinuity filter approach. Here, the TSDF voxel block model has the same size as in the base approach and is, hence, considerably larger than in the other approaches. In contrast, our full system reduces the request rate requirements to 12Hz for most scenes making it the preferred choice for this parameter. This significantly improves its scalability to more than 24 clients in all scenes which is sufficient for applications in education, entertainment or collaboration scenarios.

5.2 Latency and Streaming Progress Analysis

(a) heating_room: B,
16.5ms (8.6ms), 3482MB
(b) heating_room: B+DDF,
10.3ms (4.4ms), 1815MB
(c) heating_room: B+VBAD,
13.2ms (6.6ms), 2548MB
(d) heating_room: B+MCVBP,
16.1ms (8.7ms), 3482MB
(e) heating_room: Ours,
9.7ms (3.3ms), 1442MB
(f) lounge: B,
10.9ms (5.0ms), 1286MB
(g) lounge: B+DDF,
10.0ms (4.4ms), 1106MB
(h) lounge: B+VBAD,
10.0ms (5.4ms), 1081MB
(i) lounge: B+MCVBP,
10.9ms (5.2ms), 1286MB
(j) lounge: Ours,
9.7ms (4.3ms), 983MB
Figure 5:

Comparison of visual quality, mean runtime (and standard deviation) as well as memory requirements for each system variant. All individual contributions reduced the amount of reconstruction artifacts while improving the overall reconstruction performance.

In addition to the scalability analysis, we also measured the streaming latency over time (see Figure 3). Similar to the original SLAMCast approach, our system has a small delay between the reconstruction process and the server process due to the shared streaming strategy. However, our optimized server model prunes unreliable or irrelevant blocks which results in a very low latency between the server and the exploration client. We also compared the latency between the largest and smallest chosen package size, i.e. 1024 and 64 blocks/request. Here, the model size of the exploration client is close to the size of the server’s update set indicating a very fast and low-latent streaming while the gap to the minimal size of the server model increases over time. Note that these two sizes are the bounds for the exploration client’s model size and clients which have reconnected, e.g. due to network outages, will receive a slightly more compact model closer to the lower bound. Reducing the package size from 1024 to 64 blocks significantly reduces the bandwidth requirements (see supplemental material for a detailed analysis) and leads to a slightly worse latency when the reconstruction process queues the currently visible voxel blocks for streaming.

In Figure 4, we also compared the different system variants regarding streaming progress and latency. For a fair comparison between the approaches, the package size is chosen such that the mean bandwidths are similar, i.e. around 15Mbit/s. Here, we also considered the size of the update set in addition to size of the server model when empty MC voxel block pruning is enabled (B+MCVBP and Ours). In these scenarios, the number of voxel blocks transmitted to the exploration client bound by these two sizes is typically close to the upper bound . In comparison to the baseline, both filtering approaches at the reconstruction side (B+DDF and B+VBAD) reduce the latency significantly. Similar results can be seen when empty MC voxel blocks are pruned (B+MCVBP). Whereas all of these approaches still introduce a noticeable delay at the time steps 40s and 90-100s, our full system is capable of streaming the reconstructed model with almost no delay across the whole sequence. Additional results regarding bandwidth and streaming latency over time are provided in the supplemental material.

5.3 Visual Quality

In order to demonstrate the benefit for standalone volumetric 3D reconstruction, we also provide a qualitative comparison regarding the visual quality of the reconstructed 3D models as well as the respective runtime and memory requirements for the individual system variants (see Figure 5). In general, all approaches generated detailed and accurate 3D models from the noisy RGB-D input data. However, without filtering, there might be some artifacts around depth discontinuities as well as in regions which have not been fully observed by the camera. These artifacts affect the overall visual experience and lead to high runtime and memory requirements. Using virtual downsampling at the voxel block allocation stage (B+VBAD), we obtain almost identical 3D models but the computational burden is significantly lower since the number of empty blocks within the model is reduced. In contrast, filtering depth samples at depth discontinuities (B+DDF) or unreliable triangle data during Marching Cubes (B+MCVBP) reduces the amount of artifacts in the aforementioned regions. Note that in standalone 3D reconstruction, voxel block pruning (B+MCVBP) mainly affects the triangulation step at the end of the capturing session which leads to results similar to the base approach regarding runtime and memory. Our full system enhances the visual quality even further and almost completely removes artifacts without sacrificing the overall model completeness. Here, we observe improvements of 10-40% and 25-60% for the runtime and memory footprint respectively depending on the scene. The objects in the lounge scene have been captured at a much smaller distance and from more angles than in the heating_room scene which leads to less unreliable input data and, hence, a lower impact of our outlier filtering approach. Additional performance measurements and results are provided in the supplemental material. In the context of live remote collaboration, a slightly less complete model can be beneficial and helps to identify regions that still need to be captured and reliably reconstructed. This, in turn, might even increase the model completeness and accuracy since the scene is more thoroughly acquired by the user.

5.4 Limitations

Despite the significant improvements in terms of scalability, latency and visual quality, our system still has some limitations. Since our work is based on the SLAMCast system, misalignments within the reconstruction might occur due to fast camera movement. While this problem has been addressed by loop-closure techniques [2, 11], their integration into live telepresence systems is still highly challenging. Furthermore, too aggressive virtual downsampling during voxel block allocation might lead to holes in the final model when some blocks covering distant objects are always skipped and, hence, never allocated. However, this is only problematic for long-range devices whereas typical RGB-D cameras have a smaller range of up to 5 meter which is still sufficient for most scenarios.

6 Conclusion

We presented a highly scalable multi-client live telepresence system which allows immersing a large number of people into a live-captured environment. For this purpose, we used well-established systems and proposed several optimizations regarding scalability, latency, and visual quality. While our contributions are designed with the telepresence system in mind, we also show their application to standalone volumetric 3D reconstruction approaches. As demonstrated in a comprehensive evaluation, our novel system allows the immersion of more than 24 people within the same scene using consumer hardware.

Acknowledgements.
This work was supported by the DFG projects KL 1142/11-1 (DFG Research Unit FOR 2535 Anticipating Human Behavior) and KL 1142/9-2 (DFG Research Unit FOR 1505 Mapping on Demand).

References

  • [1] J. Chen, D. Bautembach, and S. Izadi. Scalable Real-time Volumetric Surface Reconstruction. ACM Trans. Graph., 32:113:1–113:16, 2013.
  • [2] A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt. BundleFusion: Real-time Globally Consistent 3D Reconstruction using On-the-fly Surface Reintegration. ACM Trans. Graph., 36(3):24, 2017.
  • [3] M. Dou et al. Fusion4D: Real-time Performance Capture of Challenging Scenes. ACM Trans. Graph., 35(4):114:1–114:13, 2016.
  • [4] A. J. Fairchild, S. P. Campion, A. S. García, R. Wolff, T. Fernando, and D. J. Roberts. A Mixed Reality Telepresence System for Collaborative Space Operation. IEEE Trans. on Circuits and Systems for Video Technology, 27(4):814–827, 2016.
  • [5] H. Fuchs, A. State, and J. Bazin. Immersive 3D Telepresence. Computer, 47(7):46–52, 2014.
  • [6] S. Golodetz, T. Cavallari, N. A. Lord, V. A. Prisacariu, D. W. Murray, and P. H. S. Torr. Collaborative Large-Scale Dense 3D Reconstruction with Online Inter-Agent Pose Optimisation. IEEE Trans. on Visualization and Computer Graphics, 24(11):2895–2905, Nov 2018.
  • [7] A. Handa, T. Whelan, J. McDonald, and A. J. Davison. A Benchmark for RGB-D Visual Odometry, 3D Reconstruction and SLAM. In Proc. of the Int. Conf. on Robotics and Automation, pp. 1524–1531, 2014.
  • [8] P. Henry, D. Fox, A. Bhowmik, and R. Mongia. Patch Volumes: Segmentation-Based Consistent Mapping with RGB-D Cameras. In Int. Conf. on 3D Vision, 2013.
  • [9] S. Izadi et al. KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera. In Proc. of the ACM Symp. on User Interface Software and Technology, pp. 559–568, 2011.
  • [10] B. Jones et al. RoomAlive: Magical Experiences Enabled by Scalable, Adaptive Projector-camera Units. In Proc. of the Annual Symp. on User Interface Software and Technology, pp. 637–644, 2014.
  • [11] O. Kähler, V. A. Prisacariu, and D. W. Murray. Real-Time Large-Scale Dense 3D Reconstruction with Loop Closure. In

    European Conference on Computer Vision

    , pp. 500–516, 2016.
  • [12] O. Kähler, V. A. Prisacariu, C. Y. Ren, X. Sun, P. Torr, and D. Murray. Very High Frame Rate Volumetric Integration of Depth Images on Mobile Devices. IEEE Trans. on Visualization and Computer Graphics, 21(11):1241–1250, 2015.
  • [13] O. Kähler, V. A. Prisacariu, J. P. C. Valentin, and D. W. Murray. Hierarchical Voxel Block Hashing for Efficient Integration of Depth Images. In IEEE Robotics and Automation Letters, pp. 1(1):192–197, 2016.
  • [14] M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich, and A. Kolb. Real-Time 3D Reconstruction in Dynamic Scenes Using Point-Based Fusion. In Proc. of Joint 3DIM/3DPVT Conference, p. 8, 2013.
  • [15] W. E. Lorensen and H. E. Cline. Marching Cubes: A High Resolution 3D Surface Construction Algorithm. In Proc. of the 14th Annual Conf. on Computer Graphics and Interactive Techniques, pp. 163–169, 1987.
  • [16] R. Maier, R. Schaller, and D. Cremers. Efficient Online Surface Correction for Real-time Large-Scale 3D Reconstruction. In British Machine Vision Conference (BMVC), 2017.
  • [17] A. Maimone, J. Bidwell, K. Peng, and H. Fuchs. Enhanced personal autostereoscopic telepresence system using commodity depth cameras. Computers & Graphics, 36(7):791 – 807, 2012.
  • [18] A. Maimone and H. Fuchs. Real-time volumetric 3D capture of room-sized scenes for telepresence. In Proc. of the 3DTV-Conference, 2012.
  • [19] D. Molyneaux, S. Izadi, D. Kim, O. Hilliges, S. Hodges, X. Cao, A. Butler, and H. Gellersen. Interactive Environment-Aware Handheld Projectors for Pervasive Computing Spaces. In Proc. of the Int. Conf. on Pervasive Computing, pp. 197–215, 2012.
  • [20] A. Mossel and M. Kröter. Streaming and exploration of dynamically changing dense 3d reconstructions in immersive virtual reality. In Proc. of IEEE Int. Symp. on Mixed and Augmented Reality, pp. 43–48, 2016.
  • [21] R. A. Newcombe et al. KinectFusion: Real-Time Dense Surface Mapping and Tracking. In Proc. of IEEE Int. Symp. on Mixed and Augmented Reality. IEEE, 2011.
  • [22] R. A. Newcombe, D. Fox, and S. M. Seitz. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In

    IEEE Conf. on Computer Vision and Pattern Recognition

    , pp. 343–352, 2015.
  • [23] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger. Real-time 3D Reconstruction at Scale Using Voxel Hashing. ACM Trans. Graph., 32(6):169:1–169:11, 2013.
  • [24] S. Orts-Escolano et al. Holoportation: Virtual 3D Teleportation in Real-time. In Proc. of the Annual Symp. on User Interface Software and Technology, pp. 741–754, 2016.
  • [25] F. Reichl, J. Weiss, and R. Westermann. Memory-Efficient Interactive Online Reconstruction From Depth Image Streams. Computer Graphics Forum, 35(8):108–119, 2016.
  • [26] H. Roth and M. Vona. Moving volume kinectfusion. In Proc. of the British Machine Vision Conference, pp. 112.1–112.11, 2012.
  • [27] P. Stotko, S. Krumpen, M. B. Hullin, M. Weinmann, and R. Klein. SLAMCast: Large-Scale, Real-Time 3D Reconstruction and Streaming for Immersive Multi-Client Live Telepresence. IEEE Trans. on Visualization and Computer Graphics, 25(5):2102–2112, 2019.
  • [28] R. Vasudevan, G. Kurillo, E. Lobaton, T. Bernardin, O. Kreylos, R. Bajcsy, and K. Nahrstedt. High-Quality Visualization for Geographically Distributed 3-D Teleimmersive Applications. IEEE Trans. on Multimedia, 13(3):573–584, 2011.
  • [29] T. Whelan, H. Johannsson, M. Kaess, J. J. Leonard, and J. McDonald. Robust Real-Time Visual Odometry for Dense RGB-D Mapping. In IEEE Int. Conf. on Robotics and Automation, pp. 5724–5731, 2013.
  • [30] T. Whelan, M. Kaess, M. Fallon, H. Johannsson, J. Leonard, and J. McDonald. Kintinuous: Spatially Extended KinectFusion. In RSS Workshop on RGB-D: Advanced Reasoning with Depth Cameras, 2012.
  • [31] T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, and J. McDonald. Real-time large-scale dense RGB-D SLAM with volumetric fusion. The Int. Journal of Robotics Research, 34(4-5):598–626, 2015.
  • [32] Q.-Y. Zhou and V. Koltun. Dense Scene Reconstruction with Points of Interest. ACM Trans. Graph., 32(4):112, 2013.