Capturing Dynamic Textured Surfaces of Moving Targets

04/11/2016 ∙ by Ruizhe Wang, et al. ∙ adobe University of Southern California The University of Texas at Austin Toyota Technological Institute at Chicago 0

We present an end-to-end system for reconstructing complete watertight and textured models of moving subjects such as clothed humans and animals, using only three or four handheld sensors. The heart of our framework is a new pairwise registration algorithm that minimizes, using a particle swarm strategy, an alignment error metric based on mutual visibility and occlusion. We show that this algorithm reliably registers partial scans with as little as 15 alternative global registration algorithms. This registration algorithm allows us to reconstruct moving subjects from free-viewpoint video produced by consumer-grade sensors, without extensive sensor calibration, constrained capture volume, expensive arrays of cameras, or templates of the subject geometry.



There are no comments yet.


page 7

page 13

page 14

page 21

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The rekindling of interest in immersive, 360-degree virtual environments, spurred on by the Oculus, Hololens, and other breakthroughs in consumer AR and VR hardware, has birthed a need for digitizing objects with full geometry and texture from all views. One of the most important objects to digitize in this way are moving, clothed humans, yet they are also among the most challenging: the human body can undergo large deformations over short time spans, has complex geometry with occluded regions that can only be seen from a small number of angles, and has regions like the face with important high-frequency features that must be faithfully preserved.

Most techniques for capturing high-quality digital humans rely on a large array of sensors mounted around a fixed capture volume. The recent work of Collet et al. [1] uses such a setup to capture live performances and compresses them to enable streaming of free-viewpoint videos. Unfortunately, these techniques are severely restrictive: first, to ensure high-quality reconstruction and sufficient coverage, a large number of expensive sensors must be used, leaving human capture out of reach of consumers without the resources of a professional studio. Second, the subject must remain within the small working volume enclosed by the sensors, ruling out subjects interacting with large, open environments or undergoing large motions.

Using free-viewpoint sensors is an attractive alternative, since it does not constrain the capture volume and allows ordinary consumers, with access to only portable, low-cost devices, to capture human motion. The typical challenge with using hand-held active sensors is that, obviously, multiple sensors must be used simultaneously from different angles to achieve adequate coverage of the subject. In overlapping regions, signal interference causes significant deterioration in the quality of the captured geometry. This problem can be avoided by minimizing the amount of overlap between sensors, but on the other hand, existing registration algorithms for aligning the captured partial scans only work reliably if the partial scans significantly overlap. Template-based methods like the work of Ye et al [2] circumvent these difficulties by warping a full geometric template to track the moving sparse partial scans, but templates are only readily available for naked humans [3]; for clothed humans a template must be precomputed on a case-by-case basis.

We thus introduce a new shape registration method that can reliably register partial scans even with almost no overlap, sidestepping the need for shape templates or sensor arrays. This method is based on a visibility error metric which encodes the intuition that if a set of partial scans are properly registered, each partial scan, when viewed from the same angle at which it was captured, should occlude all other partial scans. We solve the global registration problem by minimizing this error metric using a particle swarm strategy, to ensure sufficient coverage of the solution space to avoid local minima. This registration method significantly outperforms state of the art global registration techniques like 4PCS [4] for challenging cases of small overlap.


We present the first end-to-end free-viewpoint reconstruction framework that produces watertight, fully-textured surfaces of moving, clothed humans using only three to four handheld depth sensors, without the need of shape templates or extensive calibration. The most significant technical component of this system is a robust pairwise global registration algorithm, based on minimizing a visibility error metric, that can align depth maps even in the presence of very little (15%) overlap.

2 Related Work

Digitizing realistic, moving characters has traditionally involved an intricate pipeline including modeling, rigging, and animation. This process has been occasionally assisted by 3D motion and geometry capture systems such as marker-based motion capture or markerless capture methods involving large arrays of sensors [5]. Both approaches supply artists with accurate reference geometry and motion, but they require specialized hardware and a controlled studio setting.

Real-time 3D scanning and reconstruction systems requiring only a single sensor, like KinectFusion [6], allow casual users to easily scan everyday objects; however, as with most simultaneous localization and mapping (SLAM) techniques, the major assumption is that the scanned scene is rigid. This assumption is invalid for humans, even for humans attempting to maintain a single pose; several follow-up works have addressed this limitation by allowing near-rigid motion, and using non-rigid partial scan alignment algorithms [7, 8]. While the recent DynamicFusion framework [9] and similar systems [10] show impressive results in capturing non-rigidly deforming scenes, our goal of capturing and tracking freely moving targets is fundamentally different: we seek to reconstruct a complete model of the moving target at all times, which requires either extensive prior knowledge of the subject’s geometry, or the use of multiple sensors to provide better coverage.

Prior work has proposed various simplifying assumptions to make the problem of capturing entire shapes in motion tractable. Examples include assuming availability of a template, high-quality data, smooth motion, and a controlled capture environment.

Template-based Tracking:

The vast majority of related work on capturing dynamic motion focuses on specific human parts, such as faces [11] and hands [12, 13], for which specialized shapes and motion templates are available. In the case of tracking the full human body, parameterized body models [14] have been used. However, such models work best on naked subjects or subjects wearing very tight clothing, and are difficult to adapt to moving people wearing more typical garments.

Another category of methods first capture a template in a static pose and then track it across time. Vlasic et al [15] use a rigged template model, and De Aguiar et al [16] apply a skeleton-less shape deformation model to the template to track human performances from multi-view video data. Other methods [17, 18] use a smoothed template to track motion from a capture sequence. The more recent work of Wu et al. [19] and Liu et al. [20] track both the surface and the skeleton of a template from stereo cameras and sparse set of depth sensors respectively.

All of these template-based approaches handle with ease the problem of tracking moving targets, since the entire geometry of the target is known. However, in addition to requiring constructing or fitting said template, these methods share the common limitation that they cannot handle geometry or topology changes which are likely to happen during typical human motion (picking up an object; crossing arms; etc).

Dynamic Shape Capture:

Several works have proposed to reconstruct both shape and motion from a dynamic motion sequence. Given a series of time-varying point clouds, Wand et al. [21] use a uniform deformation model to capture both geometry and motion. A follow-up work [22] proposes to separate the deformation models used for geometry and motion capture. Both methods make the strong assumption that the motion is smooth, and thus suffer from popping artifacts in the case of large motions between time steps. Süßmuth et al. [23] fit a 4D space-time surface to the given sequence but they assume that the complete shape is visible in the first frame. Finally, Tevs et al. [24] detect landmark correspondences which are then extended to dense correspondences. While this method can handle a considerable amount of topological change, it is sensitive to large acquisition holes, which are typical for commercial depth sensors.

Another category of related work aims to reconstruct a deforming watertight mesh from a dynamic capture sequence by imposing either visual hull [25] or temporal coherency constraints [26]. Such constraints either limit the capture volume or are not sufficient to handle large holes. Furthermore, neither of these methods focus on propagating texture to invisible areas; in contrast, we use dense correspondences to perform texture inpainting in non-visible regions. Bojsen-Hansen et al. [27] also use dense correspondences to track surfaces with evolving topologies. However, their method requires the input to be a closed manifold surface. Our goal, on the other hand, is to reconstruct such complete meshes from sparse partial scans.

The recent work of Collet et al. [1] uses multimodal input data from a stage setup to capture topologically-varying scenes. While this method produces impressive results, it requires a pre-calibrated complex setup. In contrast, we use a significantly cheaper and more convenient setup composed of three to four commercial depth sensors.

Global Range Image Registration:

At the heart of our approach is a robust algorithm that registers noisy data coming from each commercial depth sensor with very little overlap. A typical approach is to first perform global registration to compute an approximate rigid transformation between a pair of range images, which is then used to initialize local registration methods (e.g., Iterative Closest Point (ICP) [28, 29]

) for further refinement. A popular approach for global registration is to construct feature descriptors for a set of interest points which are then correlated to estimate a rigid transformation. Spin-images 

[30], integral volume descriptors [31], and point feature histograms (PFH, FPFH) [32, 33] are among the popular descriptors proposed by prior work. Makadia et al. [34] represent each range image as a translation-invariant emphextended gaussian Image (EGI) [35]

using surface normals. They first compute the optimum rotation by correlating two EGIs and further estimate the corresponding translation using Fourier transform. For noisy data as coming from a commercial depth sensor, however, it is challenging to compute reliable feature descriptors. Another approach for global registration is to align either main axes extracted by principal component analysis (PCA) 

[36] or a sparse set of control points in a RANSAC loop [37]. Silva et al. [38] introduce a robust surface interpenetration measure (SIM)

and search the 6 DoF parameter space with a genetic algorithm. More recently, Yang et al. 

[39] adopt a branch-and-bound strategy to extend the basic ICP algorithm in a global manner. 4PCS [4] and its latest variant Super-4PCS [40] register a pair of range images by extracting all coplanar 4-points sets. Such approaches, however, are likely to converge to wrong alignments in cases of very little overlap between the range images (see Section 5).

Several prior works have adopted silhouette-based constraints for aligning multiple images [41, 42, 43, 44, 45, 46, 47]. While the idea is similar to our approach, our registration algorithm also takes advantage of depth information, and employs a particle-swarm optimization strategy that efficiently explores the space of alignments.

3 System Overview

Figure 1: An overview of our textured dynamic surface capturing system.

Our pipeline for reconstructing fully-textured, watertight meshes from three to four depth sensors can be decomposed into four major steps. See Figure 1 for an overview of how these steps interrelate.

1. Data Capture: We capture the subject (who is free to move arbitrarily) using uncalibrated hand-held real-time RGBD sensors. We experimented with both Kinect One time-of-flight cameras mounted on laptops, and Occipital Structure IO sensors mounted on iPad Air 2 tablets (section 6).

2. Global Rigid Registration: The relative positions of the depth sensors constantly change over time, and the captured depth maps often have little overlap (10%-30%). For each frame, we globally register sparse depth images from all views (section 4). This step produces registered, but incomplete, textured partial scans of the subject for each frame.

3. Surface Reconstruction: To reduce flickering artifacts, we adopt the shape completion pipeline of Li et al [26] to warp partial scans from temporally-proximate frames to the current frame geometry. A weighted Poisson reconstruction step then extracts a single watertight surface. There is no guarantee, however, that the resulted fused surface has complete texture coverage (and indeed typically texture will be missing at partial scan seams and in occluded regions.)

4. Dense Correspondences for Texture Reconstruction: We complete regions of missing or unreliable texture on one frame by propagating data from other (perhaps very temporally-distant) frames with reliable texture in that region. We adopt a recently-proposed correspondence computation framework [48]

based on a deep neural network to build dense correspondences between any two frames, even if the subject has undergone large relative deformations. Upon building dense correspondences, we transfer texture from reliable regions to less reliable ones.

We next describe the details of the global registration method as it constitutes the core of our pipeline. Please refer to the supplementary material for more details of the other components.

4 Robust Rigid Registration

The key technical challenge in our pipeline is registering a set of depth images accurately without assuming any initialization, even when the geometry visible in each depth image has very little overlap with any other depth image. We attack this problem by developing a robust pairwise global registration method: let and be partial meshes generated from two depth images captured simultaneously. We seek a global Euclidean transformation which aligns to . Traditional pairwise registration based on finding corresponding points on and , and minimizing the distance between them, has notorious difficulty in this setting. As such we propose a novel visibility error metric (VEM) (Section 4.1), and we minimize the VEM to find (Section 4.2). We further extend this pairwise method to handle multi-view global registration (Section 4.3).

4.1 Visibility Error Metric

Figure 2: Left: two partial scans (dotted) and (solid) of a 2D bunny. Middle: when viewed from ’s camera, is entirely occluded (blue). Therefore all of is in . Right: when viewed from ’s camera, parts of are in (blue), parts occlude and are thus in (yellow), and parts are in (red).

Suppose and are correctly aligned, and consider looking at the pair of scans through a camera whose position and orientation matches that of the sensor used to capture . The only parts of that should be visible from this view are those that overlap with : parts of that do not overlap should be completely occluded by (otherwise they would have been detected and included in ). Similarly, when looking at the scene through the camera that captured , only parts of that overlap with should be visible.

Visibility-Based Alignment Error We now formalize the above idea. Let be two partial scans, with captured using a sensor at position and view direction . For every point , let be the first intersection point of and the ray . We can partition into three regions, and associate to each region an energy density measuring the extent to which points in that region violate the above visibility criteria:

  • points that are occluded by : To points in this region we associate no energy:

  • points that are in front of : Such points might exist even when and are well-aligned, due to surface noise and roughness, etc. However, we penalize large violations using:

  • points for which does not exist. Such points also violate the visibility criteria. It is tempting to penalize such points proportionally to the distance between and its closest point on , but a small misalignment could create a point in that is very distant from in Euclidean space, despite being very close to on the camera image plane. We therefore penalize using squared distance on the image plane,

    where is the projection onto the plane orthogonal to .

Figure 2 illustrates these regions on a didactic 2D example. Alignment of and from the point of view of is then measured by the aggregate energy . Finally, every Euclidean transformation that produces a possible alignment between and can be associated with an energy to define our visibility error metric on ,


4.2 Finding the Transformation

Figure 3:

(a) Left: a pair of range images to be registered. Right: VEM evaluated on the entire rotation space. Each point within the unit ball represents the vector part of a unit quaternion; for each quaternion, we estimate its corresponding translation component and evaluate the VEM on the composite transformation. The red rectangles indicate areas with local minima, and the red cross is the global minimum. (b) Example particle locations and displacements at iteration

and . Blue vectors indicate displacement of regular (non-guide) particles following a traditional particle swarm scheme. Red vectors are displacements of guide particles. Guide particles draw neighboring regular particles more efficiently towards local minima to search for the global minimum.

Minimizing the error metric (1) consists of solving a nonlinear least squares problem and so in principle can be optimized using e.g. the Gauss-Newton method. However, it is non-convex, and prone to local minima (Figure 3

). Absent a straightforward heuristic for picking a good initial guess, we instead adopt a Particle Swarm Optimization (PSO) 

[49] method to efficiently minimize (1), where “particles” are candidate rigid transformations that move towards smaller energy landscapes in . We could independently minimize starting from each particle as an initial guess, but this strategy is not computationally tractable. So we iteratively update all particle positions in lockstep: a small set of the most promising guide particles, that are most likely to be close to the global minimum, are updated using an iteration of Levenberg-Marquardt. The rest of the particles receive PSO-style weighted random perturbations. This procedure is summarized in Algorithm 1, and each step is described in more detail below.

1:Input: A set of initial “particles” (orientations)
2:evaluate VEM on initial particles
3:for each iteration do
4:     select guide particles
5:     for each guide particle do
6:         update guide particle using Levenberg-Marquardt
7:     end for
8:     for each regular particle do
9:         update particle using weighted random displacement
10:     end for
11:     recalculate VEM at new locations
12:end for
13:Output: The best particle
Algorithm 1 Modified Particle Swarm Optimization

Initial Particle Sampling We begin by sampling particles (we use ), where each particle represents a rigid motion . Since is not compact, it is not straightforward to directly sample the initial particles. We instead uniformly sample only the rotational component of each particle [50], and solve for the best translation using the following Hough-transform-like procedure. For every and , we measure the angle between their respective normals, and if it is less than , the pair votes for a translation of . These translations are binned (we use bins) and the best translation is extracted from the bin with the most votes. The translation estimation procedure is robust even in the presence of limited overlap amount (Figure 4).

The above procedure yields a set of initial particles. We next describe how to step the orientation particles from their values at iteration to at iteration .

Figure 4: Translation estimation examples of our Hough Transform method on range scans with limited overlap. The naïve method, which simply aligns the corresponding centroids, fails to estimate the correct translation.

Identifying Guide Particles We want to select as guide particles those particles with lowest visibility error metric; however we don’t want many clustered redundant guide particles. Therefore we first promote the particle with lowest error metric to guide particle, then remove from consideration all nearby particles, e.g. those that satisfy

where is the bi-invariant metric on , e.g. the least angle of all rotations with We use . We then repeat this process (promoting the remaining particle with lowest VEM, removing nearby particles, etc) until no candidates remain.

Guide Particle Update We update each guide particle to decrease its VEM. We parameterize the tangent space of at by two vectors with , where is the cross-product matrix. We then use the Levenberg-Marquardt method to find an energy-decreasing direction , and set . Please see the supplementary material for more details.

Other Particle Update Performing a Levenberg-Marquardt iteration on all particles is too expensive, so we move the remaining non-guide particles by applying a randomly weighted summation of each particle’s displacement during the previous iteration, the displacement towards its best past position, and the displacement towards the local best particle within radius (measured using ) with lowest energy, as in standard PSO [49]. While the guide particles rapidly descend to local minima, they are also local best particles and drag neighboring regular particles with them for a more efficient search of all local minima, from which the global one is extracted (Figure 3). Please refer to the supplementary material for more details.

Termination Since the VEM of each guide particle is guaranteed to decrease during every iteration, the particle with lowest energy is always selected as a guide particle, and the local minima of must lie in a bounded subset of . In the above procedure the particle with lowest energy is guaranteed to converge to a local minimum of . We terminate the optimization when In practice this occurs within 5–10 iterations.

4.3 Multi-view Extension

We extend our VEM-based pairwise registration method to globally align a total of partial scans {} by estimating the optimum transformation set {}. First we perform pairwise registration between all pairs to build a registration graph, where each vertex represents a partial scan and each pair of vertices are linked by an edge of the estimated transformation. We then extract all spanning trees from the graph, and for each spanning tree we calculate its corresponding transformation set {} and estimate the overall VEM as,


We select the transformation set with the minimum overall VEM. We perform several iterations of Levenberg-Marquardt algorithm to minimize Equation 2 to further jointly refine the transformation set.

Temporal Coherence When globally registering depth images from multiple sensors frame by frame, we can easily incorporate temporal coherence into the global registration framework by adding the final estimated transformation set of the previous frame to the pool of transformation sets of the current frame before selecting the best one. It is worth mentioning, however, that our capturing system does not rely on the assumption of temporal coherence and the transformation set is estimated globally for each frame. This is especially crucial for a system with handheld sensors, where the temporal coherence assumption is easily violated.

5 Global Registration Evaluation

Data Sets.

We evaluate our registration algorithm on the Stanford 3D Scanning Repository and the Princeton Shape Benchmark [51]. We use 4 models from the Stanford 3D Scanning Repository (the Bunny, the Happy Buddha, the Dragon, and the Amardillo), and use all 1814 models from the Princeton Shape Benchmark. We believe these two data sets, especially the latter, are general enough to cover shape variation of real world objects. For each data set, we generated 1000 pairs of synthetic depth images with uniformly varying degrees of overlap; these range maps were synthesized using randomly-selected 3D models and randomly-selected camera angles. Each pair is then initialized with a random initial relative transformation. As such, for each pair of range images, we have the ground truth transformation as well as their overlap ratio.

Evaluation Metric.

The extracted transformation, if not correctly estimated, can be at any distance from the ground truth transformation, depending on the specific shape of the underlying surfaces and the local minima distribution of the solution space. Thus, it is not very informative to directly use the RMSE of rotation and translation estimation. It is rather straightforward to use success percentage as the evaluation metric. We claim the global registration to be successful if the error

of the estimated rotation is smaller than a small angle . We do not enforce the translation to be close since it is scale-dependent and the translation component is easily recovered by a robust local registration method if the rotation component is close enough (e.g., by using surface normals to prune incorrect correspondences [52]).

Effectiveness of the PSO Strategy.

To demonstrate the advantage of the particle-swarm optimization strategy, we compare our full algorithm to three alternatives on the Stanford 3D Scanning Repository: 1) a baseline method that simply reports the minimum particles from all initially-sampled particles, with no attempt at optimization; 2) using only a traditional PSO formulation, without guide particles; and 3) updating only the guide particles, and applying no displacement to ordinary particles.

Figure 5: Success percentage of the global registration method employing different optimization schemes on the Stanford 3D Scanning Repository.

Figure 5 compares the performance of the four alternatives. While updating guide particles alone achieves good registration results, incorporating the swarm intelligence further improves the performance, especially on range scans with overlap ratios below .


To demonstrate the effectiveness of the proposed registration method, we compare it against four other alternatives: 1) a baseline method that aligns principal axes extracted with weighted PCA [36], where the weight of each vertex is proportional to its local surface area; 2) Go-ICP [39], which combines local ICP with a branch-and-bound search to find the global minima; 3) FPFH [33, 53], which matches FPFH descriptors; 4) 4PCS, a state-of-the-art method that performs global registration by constructing a congruent set of 4 points between range images [4]. We do not compare with its latest variant SUPER-4PCS [40] as only efficiency is improved for the latter. For Go-ICP, FPFH and 4PCS, we use the authors’ original implementation and tune parameters to achieve optimum performance.

Figure 6: Success percentage of our global registration method compared with other methods. Left: Comparison on the Stanford 3D Scanning Repository. Right: Comparison on the Princeton Shape Benchmark.

Figure 6 compares the performance of the five methods on the two data sets respectively. The overall performance on the Princeton Shape Benchmark is lower as this data set is more challenging with many symmetric objects. As expected the baseline PCA method only works well when there is sufficient overlap. All previous methods experience a dramatic fall in accuracy once the overlap amount drops below ; 4PCS performs the best out of these, but because 4PCS is essentially searching for the most consistent area shared by two shapes, for small overlap ratio, it can converge to false alignments (Figure 11). Our method outperforms all previous approaches, and doesn’t experience degraded performance until overlap falls below . The average performance is summarized in Table 1.

Stanford (%) 19.5 34.1 49.3 73.0 93.6
Princeton (%) 18.5 22.0 33.0 73.2 81.5
Runtime (sec) 0.01 25 3 10 0.5
Table 1: Performance of global registration algorithms on two data sets. Average running time is measured using a single thread on an Intel Core i7-4710MQ CPU clocked at 2.5 GHz.
Figure 7: Example registration results of range images with limited overlap. First and second row show examples from the Stanford 3D Scanning Repository and the Princeton Shape Benchmark respectively. Please see the supplementary material for more examples.

Performance on Real Data.

We further compare the performance of our registration method with 4PCS on pairs of depth maps captured from Kinect One and Structure IO sensors. The hardware setup used to obtain this data is described in detail in the next section. These depth maps share only 10%-30% overlap and 4PCS often fails to compute the correct alignment as shown in Figure 8.

Figure 8: Our registration method compared with 4PCS on real data. First two examples are captured by Kinect One sensors while the last example is captured by Structure IO sensors.


Our global registration method, like most other methods, fails to align scans with dominant symmetries since in such cases depth alone is not enough to resolve the ambiguity. This limitation holds for scans depicting large planar surfaces (e.g. walls and ground) due to continuous symmetry.

6 Dynamic Capture Results

Hardware. We provide results of our dynamic scene capture system. We experiment with two popular depth sensors, namely the Kinect One (V2) sensor and the Structure IO sensor. We mount the former on laptops and extend the capture range with long power extension cables. For the latter, we attach it to iPad Air 2 tablets and stream data to laptops through wireless network. Kinect One sensors stream high-fidelity 512x424 depth images and 1920x1080 color images at 30 fps. We use it to cover the entire human body from 3 or 4 views at approximately 2 meters away. Structure IO sensors stream 640x480 for both depth and color (iPad RGB camera after compression) images at 30 fps. Per pixel depth accuracy of the Structure IO sensor is relatively low and unreliable, especially when used outdoor beyond 2 meters. Thus, we use it to capture small objects, e. g. , dogs and children, at approximately 1 meter away. Our mobile capture setting allows the subject to move freely in space in stead of being restricted to a specific capture volume.

Pre-processing. For each depth image, first we remove background by thresholding depth value and removing dominant planar segments in a RANSAC fashion. For temporal synchronization across depth sensors, we use visual cues, i. e. , jumping and clapping hands, to manually initialize the starting frame. Then we automatically synchronize all remaining frames by using the system time stamp of each frame, which is accurate up to milliseconds.

Figure 9: From left to right: Globally aligned partial scans from multiple depth sensors; The water-tight mesh model after Poisson reconstruction [54]; Denoised mesh after merging neighboring meshes by using [26]; Model after our dense correspondences based texture reconstruction; Model after directly applying texture-stitcher [55].

Performance. We process data using a single thread Intel Core i7-4710MQ CPU clocked at 2.5 GHz. It takes on average 15 seconds to globally align all the views for each frame, 5 minutes for surface denoising and reconstruction, and 3 minutes for building dense correspondences and texture reconstruction.

Results. We capture a variety of motions and objects, including walking, jumping, playing Tai Chi and dog training (see the supplementary material for a complete list). For all captures, the performer(s) are able to move freely in space while 3 or 4 people follow them with depth sensors. As shown in Figure 9, our geometry reconstruction method reduces flickering artifacts of the original Poisson reconstruction, and our texture reconstruction method recovers reliable texture on occluded areas. Figure 10 provides several examples that demonstrate the effectiveness and flexibility of our capture system. Our global registration method plays a key role as most range images share only 10% to 30% overlap. While we demonstrate successful sequences with 3 depth sensors, using an additional sensor typically improves the reconstruction quality since it provides higher overlap between neighboring views leading to a more robust registration.

As opposed to most existing free-form surface reconstruction techniques, our method can handle performances of subjects that move through a long trajectory instead of being constrained to a capture volume. Since our method does not require a template, it is not restricted to human performances and can successfully capture animals for which obtaining a static template would be challenging. The global registration method employed for each frame effectively reduces drift for long capture sequences. We can recover plausible textures even in regions that are not fully captured by the sensors using textures from frames where they are visible.

Figure 10: Example capturing results. The sequence in the lower right corner is reconstructed from Structure IO sensors, while other sequences are reconstructed from Kinect One Sensors.

7 Conclusion

We have demonstrated that it is possible, using only a small number of synchronized consumer-grade handheld sensors, to reconstruct fully-textured moving humans, and without restricting the subject to the constrained environment required by stage setups with calibrated sensor arrays. Our system does not require a template geometry in advance and thus can generalize well to a variety of subjects including animals and small children. Since our system is based on low-cost devices and works in fully unconstrained environments, we believe our system is an important step toward accessible creation of VR and AR content for consumers. Our results depend critically on our new alignment algorithm based on the visibility error metric, which can reliably align partial scans with much less overlap than is required by current state-of-the-art registration algorithms. Without this alignment algorithm, we would need to use many more sensors, and solve the sensor interference problem that would arise. We believe this algorithm is an important contribution on its own, as it represents a significant step forward in global registration.


  • [1] Collet, A., Chuang, M., Sweeney, P., Gillett, D., Evseev, D., Calabrese, D., Hoppe, H., Kirk, A., Sullivan, S.: High-quality streamable free-viewpoint video. In: ACM SIGGRAPH. Volume 34., ACM (July 2015) 69:1–69:13
  • [2] Ye, G., Deng, Y., Hasler, N., Ji, X., Dai, Q., Theobalt, C.: Free-viewpoint video of human actors using multiple handheld kinects. IEEE Transactions on Cybernetics 43(5) (2013) 1370–1382
  • [3] Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: Shape completion and animation of people. ACM Trans. Graph. 24(3) (July 2005) 408–416
  • [4] Aiger, D., Mitra, N.J., Cohen-Or, D.: 4-points congruent sets for robust pairwise surface registration. In: ACM Transactions on Graphics (TOG). Volume 27., ACM (2008)  85
  • [5] Debevec, P.: The Light Stages and Their Applications to Photoreal Digital Actors. In: SIGGRAPH Asia, Singapore (November 2012)
  • [6] Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., Davison, A., Fitzgibbon, A.: Kinectfusion: Real-time 3d reconstruction and interaction using a moving depth camera. In: UIST, New York, NY, USA, ACM (2011) 559–568
  • [7] Tong, J., Zhou, J., Liu, L., Pan, Z., Yan, H.: Scanning 3d full human bodies using kinects. IEEE TVCG 18(4) (April 2012) 643–650
  • [8] Li, H., Vouga, E., Gudym, A., Luo, L., Barron, J.T., Gusev, G.: 3d self-portraits. In: ACM SIGGRAPH Asia. Volume 32., ACM (November 2013) 187:1–187:9
  • [9] Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In: IEEE CVPR. (June 2015)
  • [10] Dou, M., Taylor, J., Fuchs, H., Fitzgibbon, A., Izadi, S.: 3d scanning deformable objects with a single rgbd sensor. In: IEEE CVPR. (June 2015) 493–501
  • [11] Li, H., Yu, J., Ye, Y., Bregler, C.: Realtime facial animation with on-the-fly correctives. In: ACM SIGGRAPH. Volume 32., ACM (July 2013) 42:1–42:10
  • [12] Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and robust hand tracking from depth. In: IEEE CVPR, IEEE (2014) 1106–1113
  • [13] Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Tracking the articulated motion of two strongly interacting hands. In: IEEE CVPR, IEEE (2012) 1862–1869
  • [14] Bogo, F., Black, M.J., Loper, M., Romero, J.: Detailed full-body reconstructions of moving people from monocular RGB-D sequences. (December 2015) 2300–2308
  • [15] Vlasic, D., Baran, I., Matusik, W., Popović, J.: Articulated mesh animation from multi-view silhouettes. In: ACM SIGGRAPH. SIGGRAPH ’08, New York, NY, USA, ACM (2008) 97:1–97:9
  • [16] de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. In: ACM SIGGRAPH, New York, NY, USA, ACM (2008) 98:1–98:10
  • [17] Li, H., Adams, B., Guibas, L.J., Pauly, M.: Robust single-view geometry and motion reconstruction. In: ACM SIGGRAPH Asia. SIGGRAPH Asia ’09, New York, NY, USA, ACM (2009) 175:1–175:10
  • [18] Zollhöfer, M., Nießner, M., Izadi, S., Rehmann, C., Zach, C., Fisher, M., Wu, C., Fitzgibbon, A., Loop, C., Theobalt, C., Stamminger, M.: Real-time non-rigid reconstruction using an rgb-d camera. In: ACM SIGGRAPH. Volume 33., New York, NY, USA, ACM (July 2014) 156:1–156:12
  • [19] Wu, C., Stoll, C., Valgaerts, L., Theobalt, C.: On-set performance capture of multiple actors with a stereo camera. ACM Trans. Graph. 32(6) (November 2013) 161:1–161:11
  • [20] Liu, Y., Ye, G., Wang, Y., Dai, Q., Theobalt, C.: Human Performance Capture Using Multiple Handheld Kinects.

    In: Computer Vision and Machine Learning with RGB-D Sensors. Springer International Publishing, Cham (2014) 91–108

  • [21] Wand, M., Jenke, P., Huang, Q., Bokeloh, M., Guibas, L., Schilling, A.: Reconstruction of deforming geometry from time-varying point clouds. In: SGP. SGP ’07 (2007) 49–58
  • [22] Wand, M., Adams, B., Ovsjanikov, M., Berner, A., Bokeloh, M., Jenke, P., Guibas, L., Seidel, H.P., Schilling, A.: Efficient reconstruction of nonrigid shape and motion from real-time 3d scanner data. ACM TOG 28(2) (May 2009) 15:1–15:15
  • [23] Süßmuth, J., Winter, M., Greiner, G.: Reconstructing animated meshes from time-varying point clouds. In: SGP. SGP ’08 (2008) 1469–1476
  • [24] Tevs, A., Berner, A., Wand, M., Ihrke, I., Bokeloh, M., Kerber, J., Seidel, H.P.: Animation cartography—intrinsic reconstruction of shape and motion. ACM TOG 31(2) (April 2012) 12:1–12:15
  • [25] Vlasic, D., Peers, P., Baran, I., Debevec, P., Popović, J., Rusinkiewicz, S., Matusik, W.: Dynamic shape capture using multi-view photometric stereo. In: ACM SIGGRAPH Asia. SIGGRAPH Asia ’09 (2009) 174:1–174:11
  • [26] Li, H., Luo, L., Vlasic, D., Peers, P., Popović, J., Pauly, M., Rusinkiewicz, S.: Temporally coherent completion of dynamic shapes. ACM TOG 31(1) (February 2012) 2:1–2:11
  • [27] Bojsen-Hansen, M., Li, H., Wojtan, C.: Tracking surfaces with evolving topology. ACM Transactions on Graphics (SIGGRAPH 2012) 31(4) (2012) 53:1–53:10
  • [28] Zhang, Z.: Iterative point matching for registration of free-form curves and surfaces. IJCV 13(2) (1994) 119–152
  • [29] Chen, Y., Medioni, G.: Object modeling by registration of multiple range images. In: ICRA, IEEE (1991) 2724–2729
  • [30] Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3d scenes. Pattern Analysis and Machine Intelligence, IEEE Transactions on 21(5) (1999) 433–449
  • [31] Gelfand, N., Mitra, N.J., Guibas, L.J., Pottmann, H.: Robust global registration. In: Symposium on geometry processing. Volume 2. (2005)  5
  • [32] Rusu, R.B., Blodow, N., Marton, Z.C., Beetz, M.: Aligning point cloud views using persistent feature histograms. In: Intelligent Robots and Systems, 2008 IEEE/RSJ International Conference on, IEEE (2008) 3384–3391
  • [33] Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (fpfh) for 3d registration. In: Robotics and Automation, 2009 IEEE International Conference on, IEEE (2009) 3212–3217
  • [34] Makadia, A., Patterson, A., Daniilidis, K.: Fully automatic registration of 3d point clouds. In: CVPR, 2006 IEEE Conference on. Volume 1., IEEE (2006) 1297–1304
  • [35] Horn, B.K.: Extended gaussian images. Proceedings of the IEEE 72(12) (1984) 1671–1686
  • [36] Chung, D.H., Yun, I.D., Lee, S.U.: Registration of multiple-range views using the reverse-calibration technique. Pattern Recognition 31(4) (1998) 457–464
  • [37] Chen, C.S., Hung, Y.P., Cheng, J.B.: Ransac-based darces: A new approach to fast automatic registration of partially overlapping range images. Pattern Analysis and Machine Intelligence, IEEE Transactions on 21(11) (1999) 1229–1234
  • [38] Silva, L., Bellon, O.R., Boyer, K.L.: Precision range image registration using a robust surface interpenetration measure and enhanced genetic algorithms. Pattern Analysis and Machine Intelligence, IEEE Transactions on 27(5) (2005) 762–776
  • [39] Yang, J., Li, H., Jia, Y.: Go-icp: Solving 3d registration efficiently and globally optimally. In: Computer Vision (ICCV), 2013 IEEE International Conference on, IEEE (2013) 1457–1464
  • [40] Mellado, N., Aiger, D., Mitra, N.J.: Super 4pcs fast global pointcloud registration via smart indexing. In: Computer Graphics Forum. Volume 33., Wiley Online Library (2014) 205–215
  • [41] Moezzi, S., Tai, L.C., Gerard, P.: Virtual view generation for 3d digital video. MultiMedia, IEEE 4(1) (Jan 1997) 18–26
  • [42] Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Image-based visual hulls. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH ’00, New York, NY, USA, ACM Press/Addison-Wesley Publishing Co. (2000) 369–374
  • [43] Franco, J., Lapierre, M., Boyer, E.: Visual shapes of silhouette sets. In: 3D Data Processing, Visualization, and Transmission, Third International Symposium on. (June 2006) 397–404
  • [44] Ahmed, N., Theobalt, C., Dobrev, P., Seidel, H.P., Thrun, S.: Robust fusion of dynamic shape and normal capture for high-quality reconstruction of time-varying geometry. In: IEEE CVPR. (June 2008) 1–8
  • [45] Starck, J., Hilton, A.: Surface capture for performance-based animation. IEEE Comput. Graph. Appl. 27(3) (May 2007) 21–31
  • [46] Wu, C., Varanasi, K., Liu, Y., Seidel, H.P., Theobalt, C.: Shading-based dynamic shape refinement from multi-view video under general illumination, IEEE (November 2011) 1108–1115
  • [47] Hernández, C., Schmitt, F., Cipolla, R.: Silhouette coherence for camera calibration under circular motion. Pattern Analysis and Machine Intelligence, IEEE Transactions on 29(2) (2007) 343–349
  • [48] Wei, L., Huang, Q., Ceylan, D., Vouga, E., Li, H.: Dense human body correspondences using convolutional networks. In: IEEE CVPR, IEEE (2016)
  • [49] Kennedy, J.: Particle swarm optimization. In: Encyclopedia of Machine Learning. Springer (2010) 760–766
  • [50] Shoemake, K.: Uniform random rotations. In: Graphics Gems III, Academic Press Professional, Inc. (1992) 124–132
  • [51] Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The princeton shape benchmark. In: Shape modeling applications, 2004. Proceedings, IEEE (2004) 167–178
  • [52] Rusinkiewicz, S., Levoy, M.: Efficient variants of the icp algorithm. In: 3-D Digital Imaging and Modeling, IEEE (2001) 145–152
  • [53] Rusu, R.B., Cousins, S.: 3d is here: Point cloud library (pcl). In: Robotics and Automation (ICRA), 2011 IEEE International Conference on, IEEE (2011) 1–4
  • [54] Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Proceedings of the fourth Eurographics symposium on Geometry processing. Volume 7. (2006)
  • [55] Chuang, M., Luo, L., Brown, B.J., Rusinkiewicz, S., Kazhdan, M.: Estimating the laplace-beltrami operator by restricting 3d functions. In: Computer Graphics Forum. Volume 28., Wiley Online Library (2009) 1475–1484
  • [56] Chuang, M., Luo, L., Brown, B.J., Rusinkiewicz, S., Kazhdan, M.: Estimating the Laplace-Beltrami operator by restricting 3D functions. Symposium on Geometry Processing (July 2009)
  • [57] Zhou, Q.Y., Koltun, V.: Color map optimization for 3d reconstruction with consumer depth cameras. In: ACM SIGGRAPH. Volume 33., ACM (July 2014) 155:1–155:10

Supplemental to Section 4.2 – Particle Update Methods

This section discusses in detail the update methods for guide particles and regular particles during the particle swam optimization.

Guide Particle Update.

Here we describe how we update guide particle ( is the particle index and is the current frame number). We parameterize the tangent space of at by with , where is the cross-product matrix . For any fixed , and partial scans , we can separate and into regions , as described in the main text. We then have

Minimizing with respect to is then a non-linear least squares problem, which we use Levenberg-Marquardt. We begin with initial guess and iteratively apply the quasi-Newton update


where is obtained from solving the linear system


is a stacked column vector such that , and is the Jacobian matrix of

calculated using chain rule. The damping factor

is set as 0.1 throughout all experiments. After m converges, we set .

Regular Particle Update.

Here we describe how we update regular (non-guide) particle ( is the particle index and is the current frame number). We parameterize as , where is the imaginary part of the quaternion representation of . Up to the sign of the real part, which is assumed positive, determines a unique unit quaternion representing the rotation . is updated in a traditional PSO fashion,


where is the velocity of previous iteration, is the best location particle has been at, and is the best particle location within radius . Please refer to [49] for more details. The fixed weights , and are set as 0.2, 0.3 and 0.3 throughout all experiments. After update, the boundary condition () is checked and enforced by normalization if violated.

Supplemental to Section 3 – Surface Reconstruction Algorithm

This section summarizes the surface reconstruction method. After globally registering partial scans of each frame, we perform Poisson surface reconstruction [54] to fuse three or four partial scans ( and denote the frame and the sensor number respectively), and we obtain a sequence of complete, watertight surfaces . To reduce flickering artifacts and to fill holes, we adopt the shape completion pipeline of Li et al [26] to warp partial scans from temporally-proximate frames to the current frame geometry. For , we warp and to align with using a mesh deformation model based on pairwise correspondences and Laplacian coordinates. We further combine them all using Poisson surface reconstruction with the following weights: 10 for the reconstructed mesh of the current frame and the warped neighboring frames, 2 for the hole-filled regions of the current frame, and 1 for the hole-filled regions of the warped neighboring frames. This imposes a mild temporal filter on the reconstructed surfaces, and a strong filter on the hole-filled regions. This step reduces the temporal flicker, and propagates some of the reconstructed surface detail from the neighboring frames onto the current frame (this stems from the neighboring reconstructed mesh weight being larger than any hole-filled region weight). Please refer to [26] for more details.

Supplemental to Section 3 – Texture Reconstruction Algorithm

This section explains in detail the texture reconstruction step based on dense correspondences. After the surface reconstruction step, we first perform texture reconstruction [56] to obtain texture for

, by fusing and interpolating the texture from partial scans

( and denote the frame and the sensor number respectively). However, each surface contains regions where this texture is unreliable, either because the region had poor coverage in the partial scans, or is located near the seam between two partial scans where the texture is inconsistent due to sensor noise and variations in lighting. When capturing clothed humans using three sensors, we observe that roughly 10–20% of the texture on each surface is unreliable.

The recent work of Zhou et al [57] presents impressive results on texturing scanned data. This method, however, assumes that the captured scene is static and thus is not applicable in our setting. Tracking methods like optical flow can be used to transfer texture between consecutive surfaces in our capture sequence, but we found them to be too fragile for our purposes: they fail if the deformation between frames is either too large (so that tracking fails) or too small (so that holes in coverage persist over large numbers of frames). Instead we replace unreliable texture on each surface by computing dense correspondences between and other surfaces in the sequence (including temporally distant frames), and transferring texture from surfaces whose texture at the corresponding point is reliable. With this approach we can reconstruct reliable texture even in the presence of large geometry or topology changes over time.

Reliability Weight.

We first need a measure of how reliable the reconstructed texture is at each point of each surface . Intuitively, texture is most reliable at points that directly face the camera; therefore for partial scans where p is visible, we set

where is the surface normal at and is the view direction of the sensor that captured . If is visible in multiple partial scans, we take the maximum weight, and if it is visible in none, we set . Furthermore we feather the weights of points that lie close to the boundaries of any partial scans, as texture at the seams tends to also be unreliable.

Computing Correspondences.

We adopt the method of Wei et al [48] to predict a pose-invariant descriptor for every vertex of each . The network of Wei et al is trained on a large dataset of captured and artificial human depth images, and can reliably compute a 16-dimensional unit length descriptor for every vertex, where nearby points in feature space are nearly-corresponding on the surfaces.

Texture Transfer.

We declare all points with unreliable and all others reliable. We set throughout all experiments. We compute descriptors for all reliable points (across all frames) and place them in a KD-tree; for each unreliable point , we compute its 50 nearest neighbors (in feature space) among reliable points, and take as the color of the weighted average of those neighbors, with each neighbor weighted by its distance from in feature space and by .

Figure 11: Example registration results of range images with limited overlap. First two rows display range image from the Stanford 3D Scanning Repository while the last four rows exhibit data from the Princeton Shape Benchmark.

Supplemental to Section 5 – Qualitative Registration Results

Fig 11 below extends figure 7 in the main text, and shows more global registration results.

Supplemental to Section 6 – List of Captured Sequence

This section lists statistics for all captured sequences in Table 2.

Sequence Sensor
Av. Vertex
Walking 1 Kinect One 3 250 250,000
Jumping Kinect One 3 209 270,000
Kicking Kinect One 3 198 260,000
Tai Chi Kinect One 4 491 128,000
Swimming Kinect One 4 370 115,000
Walking 2 Kinect One 4 201 160,000
Dog 1 Kinect One 4 441 150,000
Dog 2 Structure IO 4 300 145,000
Table 2: List of all captured sequences.

Supplemental to Section 6 – Limitations of Capture System

This section covers limitations of the proposed capture system. The global registration fails when there is barely no overlap, i. e. , below , potentially caused by two neighboring sensors drifting apart. Our method fails to capture fast motion, e. g. , jumping, due to minor asynchronization across different sensors (Figure 12). Because of the sparse views, there can be potentially consistent occluded regions, for which the texture cannot be accurately recovered from other frames (Figure 12). Finally, in large occluded regions, Poisson reconstruction might fill in missing surface data with geometry far from the ground truth human shape. In the future we wish to repair these regions by propagating details using a similar approach to how we fix the texture.

Figure 12: Left: Registration failure of frames with fast motion due to minor asynchronization across different sensors. Right: Failed texture reconstruction on consistently occluded regions.