1 Introduction
The rekindling of interest in immersive, 360degree virtual environments, spurred on by the Oculus, Hololens, and other breakthroughs in consumer AR and VR hardware, has birthed a need for digitizing objects with full geometry and texture from all views. One of the most important objects to digitize in this way are moving, clothed humans, yet they are also among the most challenging: the human body can undergo large deformations over short time spans, has complex geometry with occluded regions that can only be seen from a small number of angles, and has regions like the face with important highfrequency features that must be faithfully preserved.
Most techniques for capturing highquality digital humans rely on a large array of sensors mounted around a fixed capture volume. The recent work of Collet et al. [1] uses such a setup to capture live performances and compresses them to enable streaming of freeviewpoint videos. Unfortunately, these techniques are severely restrictive: first, to ensure highquality reconstruction and sufficient coverage, a large number of expensive sensors must be used, leaving human capture out of reach of consumers without the resources of a professional studio. Second, the subject must remain within the small working volume enclosed by the sensors, ruling out subjects interacting with large, open environments or undergoing large motions.
Using freeviewpoint sensors is an attractive alternative, since it does not constrain the capture volume and allows ordinary consumers, with access to only portable, lowcost devices, to capture human motion. The typical challenge with using handheld active sensors is that, obviously, multiple sensors must be used simultaneously from different angles to achieve adequate coverage of the subject. In overlapping regions, signal interference causes significant deterioration in the quality of the captured geometry. This problem can be avoided by minimizing the amount of overlap between sensors, but on the other hand, existing registration algorithms for aligning the captured partial scans only work reliably if the partial scans significantly overlap. Templatebased methods like the work of Ye et al [2] circumvent these difficulties by warping a full geometric template to track the moving sparse partial scans, but templates are only readily available for naked humans [3]; for clothed humans a template must be precomputed on a casebycase basis.
We thus introduce a new shape registration method that can reliably register partial scans even with almost no overlap, sidestepping the need for shape templates or sensor arrays. This method is based on a visibility error metric which encodes the intuition that if a set of partial scans are properly registered, each partial scan, when viewed from the same angle at which it was captured, should occlude all other partial scans. We solve the global registration problem by minimizing this error metric using a particle swarm strategy, to ensure sufficient coverage of the solution space to avoid local minima. This registration method significantly outperforms state of the art global registration techniques like 4PCS [4] for challenging cases of small overlap.
Contributions.
We present the first endtoend freeviewpoint reconstruction framework that produces watertight, fullytextured surfaces of moving, clothed humans using only three to four handheld depth sensors, without the need of shape templates or extensive calibration. The most significant technical component of this system is a robust pairwise global registration algorithm, based on minimizing a visibility error metric, that can align depth maps even in the presence of very little (15%) overlap.
2 Related Work
Digitizing realistic, moving characters has traditionally involved an intricate pipeline including modeling, rigging, and animation. This process has been occasionally assisted by 3D motion and geometry capture systems such as markerbased motion capture or markerless capture methods involving large arrays of sensors [5]. Both approaches supply artists with accurate reference geometry and motion, but they require specialized hardware and a controlled studio setting.
Realtime 3D scanning and reconstruction systems requiring only a single sensor, like KinectFusion [6], allow casual users to easily scan everyday objects; however, as with most simultaneous localization and mapping (SLAM) techniques, the major assumption is that the scanned scene is rigid. This assumption is invalid for humans, even for humans attempting to maintain a single pose; several followup works have addressed this limitation by allowing nearrigid motion, and using nonrigid partial scan alignment algorithms [7, 8]. While the recent DynamicFusion framework [9] and similar systems [10] show impressive results in capturing nonrigidly deforming scenes, our goal of capturing and tracking freely moving targets is fundamentally different: we seek to reconstruct a complete model of the moving target at all times, which requires either extensive prior knowledge of the subject’s geometry, or the use of multiple sensors to provide better coverage.
Prior work has proposed various simplifying assumptions to make the problem of capturing entire shapes in motion tractable. Examples include assuming availability of a template, highquality data, smooth motion, and a controlled capture environment.
Templatebased Tracking:
The vast majority of related work on capturing dynamic motion focuses on specific human parts, such as faces [11] and hands [12, 13], for which specialized shapes and motion templates are available. In the case of tracking the full human body, parameterized body models [14] have been used. However, such models work best on naked subjects or subjects wearing very tight clothing, and are difficult to adapt to moving people wearing more typical garments.
Another category of methods first capture a template in a static pose and then track it across time. Vlasic et al [15] use a rigged template model, and De Aguiar et al [16] apply a skeletonless shape deformation model to the template to track human performances from multiview video data. Other methods [17, 18] use a smoothed template to track motion from a capture sequence. The more recent work of Wu et al. [19] and Liu et al. [20] track both the surface and the skeleton of a template from stereo cameras and sparse set of depth sensors respectively.
All of these templatebased approaches handle with ease the problem of tracking moving targets, since the entire geometry of the target is known. However, in addition to requiring constructing or fitting said template, these methods share the common limitation that they cannot handle geometry or topology changes which are likely to happen during typical human motion (picking up an object; crossing arms; etc).
Dynamic Shape Capture:
Several works have proposed to reconstruct both shape and motion from a dynamic motion sequence. Given a series of timevarying point clouds, Wand et al. [21] use a uniform deformation model to capture both geometry and motion. A followup work [22] proposes to separate the deformation models used for geometry and motion capture. Both methods make the strong assumption that the motion is smooth, and thus suffer from popping artifacts in the case of large motions between time steps. Süßmuth et al. [23] fit a 4D spacetime surface to the given sequence but they assume that the complete shape is visible in the first frame. Finally, Tevs et al. [24] detect landmark correspondences which are then extended to dense correspondences. While this method can handle a considerable amount of topological change, it is sensitive to large acquisition holes, which are typical for commercial depth sensors.
Another category of related work aims to reconstruct a deforming watertight mesh from a dynamic capture sequence by imposing either visual hull [25] or temporal coherency constraints [26]. Such constraints either limit the capture volume or are not sufficient to handle large holes. Furthermore, neither of these methods focus on propagating texture to invisible areas; in contrast, we use dense correspondences to perform texture inpainting in nonvisible regions. BojsenHansen et al. [27] also use dense correspondences to track surfaces with evolving topologies. However, their method requires the input to be a closed manifold surface. Our goal, on the other hand, is to reconstruct such complete meshes from sparse partial scans.
The recent work of Collet et al. [1] uses multimodal input data from a stage setup to capture topologicallyvarying scenes. While this method produces impressive results, it requires a precalibrated complex setup. In contrast, we use a significantly cheaper and more convenient setup composed of three to four commercial depth sensors.
Global Range Image Registration:
At the heart of our approach is a robust algorithm that registers noisy data coming from each commercial depth sensor with very little overlap. A typical approach is to first perform global registration to compute an approximate rigid transformation between a pair of range images, which is then used to initialize local registration methods (e.g., Iterative Closest Point (ICP) [28, 29]
) for further refinement. A popular approach for global registration is to construct feature descriptors for a set of interest points which are then correlated to estimate a rigid transformation. Spinimages
[30], integral volume descriptors [31], and point feature histograms (PFH, FPFH) [32, 33] are among the popular descriptors proposed by prior work. Makadia et al. [34] represent each range image as a translationinvariant emphextended gaussian Image (EGI) [35]using surface normals. They first compute the optimum rotation by correlating two EGIs and further estimate the corresponding translation using Fourier transform. For noisy data as coming from a commercial depth sensor, however, it is challenging to compute reliable feature descriptors. Another approach for global registration is to align either main axes extracted by principal component analysis (PCA)
[36] or a sparse set of control points in a RANSAC loop [37]. Silva et al. [38] introduce a robust surface interpenetration measure (SIM)and search the 6 DoF parameter space with a genetic algorithm. More recently, Yang et al.
[39] adopt a branchandbound strategy to extend the basic ICP algorithm in a global manner. 4PCS [4] and its latest variant Super4PCS [40] register a pair of range images by extracting all coplanar 4points sets. Such approaches, however, are likely to converge to wrong alignments in cases of very little overlap between the range images (see Section 5).Several prior works have adopted silhouettebased constraints for aligning multiple images [41, 42, 43, 44, 45, 46, 47]. While the idea is similar to our approach, our registration algorithm also takes advantage of depth information, and employs a particleswarm optimization strategy that efficiently explores the space of alignments.
3 System Overview
Our pipeline for reconstructing fullytextured, watertight meshes from three to four depth sensors can be decomposed into four major steps. See Figure 1 for an overview of how these steps interrelate.
1. Data Capture: We capture the subject (who is free to move arbitrarily) using uncalibrated handheld realtime RGBD sensors. We experimented with both Kinect One timeofflight cameras mounted on laptops, and Occipital Structure IO sensors mounted on iPad Air 2 tablets (section 6).
2. Global Rigid Registration: The relative positions of the depth sensors constantly change over time, and the captured depth maps often have little overlap (10%30%). For each frame, we globally register sparse depth images from all views (section 4). This step produces registered, but incomplete, textured partial scans of the subject for each frame.
3. Surface Reconstruction: To reduce flickering artifacts, we adopt the shape completion pipeline of Li et al [26] to warp partial scans from temporallyproximate frames to the current frame geometry. A weighted Poisson reconstruction step then extracts a single watertight surface. There is no guarantee, however, that the resulted fused surface has complete texture coverage (and indeed typically texture will be missing at partial scan seams and in occluded regions.)
4. Dense Correspondences for Texture Reconstruction: We complete regions of missing or unreliable texture on one frame by propagating data from other (perhaps very temporallydistant) frames with reliable texture in that region. We adopt a recentlyproposed correspondence computation framework [48]
based on a deep neural network to build dense correspondences between any two frames, even if the subject has undergone large relative deformations. Upon building dense correspondences, we transfer texture from reliable regions to less reliable ones.
We next describe the details of the global registration method as it constitutes the core of our pipeline. Please refer to the supplementary material for more details of the other components.
4 Robust Rigid Registration
The key technical challenge in our pipeline is registering a set of depth images accurately without assuming any initialization, even when the geometry visible in each depth image has very little overlap with any other depth image. We attack this problem by developing a robust pairwise global registration method: let and be partial meshes generated from two depth images captured simultaneously. We seek a global Euclidean transformation which aligns to . Traditional pairwise registration based on finding corresponding points on and , and minimizing the distance between them, has notorious difficulty in this setting. As such we propose a novel visibility error metric (VEM) (Section 4.1), and we minimize the VEM to find (Section 4.2). We further extend this pairwise method to handle multiview global registration (Section 4.3).
4.1 Visibility Error Metric
Suppose and are correctly aligned, and consider looking at the pair of scans through a camera whose position and orientation matches that of the sensor used to capture . The only parts of that should be visible from this view are those that overlap with : parts of that do not overlap should be completely occluded by (otherwise they would have been detected and included in ). Similarly, when looking at the scene through the camera that captured , only parts of that overlap with should be visible.
VisibilityBased Alignment Error We now formalize the above idea. Let be two partial scans, with captured using a sensor at position and view direction . For every point , let be the first intersection point of and the ray . We can partition into three regions, and associate to each region an energy density measuring the extent to which points in that region violate the above visibility criteria:

points that are occluded by : To points in this region we associate no energy:

points that are in front of : Such points might exist even when and are wellaligned, due to surface noise and roughness, etc. However, we penalize large violations using:

points for which does not exist. Such points also violate the visibility criteria. It is tempting to penalize such points proportionally to the distance between and its closest point on , but a small misalignment could create a point in that is very distant from in Euclidean space, despite being very close to on the camera image plane. We therefore penalize using squared distance on the image plane,
where is the projection onto the plane orthogonal to .
Figure 2 illustrates these regions on a didactic 2D example. Alignment of and from the point of view of is then measured by the aggregate energy . Finally, every Euclidean transformation that produces a possible alignment between and can be associated with an energy to define our visibility error metric on ,
(1) 
4.2 Finding the Transformation
(a) Left: a pair of range images to be registered. Right: VEM evaluated on the entire rotation space. Each point within the unit ball represents the vector part of a unit quaternion; for each quaternion, we estimate its corresponding translation component and evaluate the VEM on the composite transformation. The red rectangles indicate areas with local minima, and the red cross is the global minimum. (b) Example particle locations and displacements at iteration
and . Blue vectors indicate displacement of regular (nonguide) particles following a traditional particle swarm scheme. Red vectors are displacements of guide particles. Guide particles draw neighboring regular particles more efficiently towards local minima to search for the global minimum.Minimizing the error metric (1) consists of solving a nonlinear least squares problem and so in principle can be optimized using e.g. the GaussNewton method. However, it is nonconvex, and prone to local minima (Figure 3
). Absent a straightforward heuristic for picking a good initial guess, we instead adopt a Particle Swarm Optimization (PSO)
[49] method to efficiently minimize (1), where “particles” are candidate rigid transformations that move towards smaller energy landscapes in . We could independently minimize starting from each particle as an initial guess, but this strategy is not computationally tractable. So we iteratively update all particle positions in lockstep: a small set of the most promising guide particles, that are most likely to be close to the global minimum, are updated using an iteration of LevenbergMarquardt. The rest of the particles receive PSOstyle weighted random perturbations. This procedure is summarized in Algorithm 1, and each step is described in more detail below.Initial Particle Sampling We begin by sampling particles (we use ), where each particle represents a rigid motion . Since is not compact, it is not straightforward to directly sample the initial particles. We instead uniformly sample only the rotational component of each particle [50], and solve for the best translation using the following Houghtransformlike procedure. For every and , we measure the angle between their respective normals, and if it is less than , the pair votes for a translation of . These translations are binned (we use bins) and the best translation is extracted from the bin with the most votes. The translation estimation procedure is robust even in the presence of limited overlap amount (Figure 4).
The above procedure yields a set of initial particles. We next describe how to step the orientation particles from their values at iteration to at iteration .
Identifying Guide Particles We want to select as guide particles those particles with lowest visibility error metric; however we don’t want many clustered redundant guide particles. Therefore we first promote the particle with lowest error metric to guide particle, then remove from consideration all nearby particles, e.g. those that satisfy
where is the biinvariant metric on , e.g. the least angle of all rotations with We use . We then repeat this process (promoting the remaining particle with lowest VEM, removing nearby particles, etc) until no candidates remain.
Guide Particle Update We update each guide particle to decrease its VEM. We parameterize the tangent space of at by two vectors with , where is the crossproduct matrix. We then use the LevenbergMarquardt method to find an energydecreasing direction , and set . Please see the supplementary material for more details.
Other Particle Update Performing a LevenbergMarquardt iteration on all particles is too expensive, so we move the remaining nonguide particles by applying a randomly weighted summation of each particle’s displacement during the previous iteration, the displacement towards its best past position, and the displacement towards the local best particle within radius (measured using ) with lowest energy, as in standard PSO [49]. While the guide particles rapidly descend to local minima, they are also local best particles and drag neighboring regular particles with them for a more efficient search of all local minima, from which the global one is extracted (Figure 3). Please refer to the supplementary material for more details.
Termination Since the VEM of each guide particle is guaranteed to decrease during every iteration, the particle with lowest energy is always selected as a guide particle, and the local minima of must lie in a bounded subset of . In the above procedure the particle with lowest energy is guaranteed to converge to a local minimum of . We terminate the optimization when In practice this occurs within 5–10 iterations.
4.3 Multiview Extension
We extend our VEMbased pairwise registration method to globally align a total of partial scans {} by estimating the optimum transformation set {}. First we perform pairwise registration between all pairs to build a registration graph, where each vertex represents a partial scan and each pair of vertices are linked by an edge of the estimated transformation. We then extract all spanning trees from the graph, and for each spanning tree we calculate its corresponding transformation set {} and estimate the overall VEM as,
(2) 
We select the transformation set with the minimum overall VEM. We perform several iterations of LevenbergMarquardt algorithm to minimize Equation 2 to further jointly refine the transformation set.
Temporal Coherence When globally registering depth images from multiple sensors frame by frame, we can easily incorporate temporal coherence into the global registration framework by adding the final estimated transformation set of the previous frame to the pool of transformation sets of the current frame before selecting the best one. It is worth mentioning, however, that our capturing system does not rely on the assumption of temporal coherence and the transformation set is estimated globally for each frame. This is especially crucial for a system with handheld sensors, where the temporal coherence assumption is easily violated.
5 Global Registration Evaluation
Data Sets.
We evaluate our registration algorithm on the Stanford 3D Scanning Repository and the Princeton Shape Benchmark [51]. We use 4 models from the Stanford 3D Scanning Repository (the Bunny, the Happy Buddha, the Dragon, and the Amardillo), and use all 1814 models from the Princeton Shape Benchmark. We believe these two data sets, especially the latter, are general enough to cover shape variation of real world objects. For each data set, we generated 1000 pairs of synthetic depth images with uniformly varying degrees of overlap; these range maps were synthesized using randomlyselected 3D models and randomlyselected camera angles. Each pair is then initialized with a random initial relative transformation. As such, for each pair of range images, we have the ground truth transformation as well as their overlap ratio.
Evaluation Metric.
The extracted transformation, if not correctly estimated, can be at any distance from the ground truth transformation, depending on the specific shape of the underlying surfaces and the local minima distribution of the solution space. Thus, it is not very informative to directly use the RMSE of rotation and translation estimation. It is rather straightforward to use success percentage as the evaluation metric. We claim the global registration to be successful if the error
of the estimated rotation is smaller than a small angle . We do not enforce the translation to be close since it is scaledependent and the translation component is easily recovered by a robust local registration method if the rotation component is close enough (e.g., by using surface normals to prune incorrect correspondences [52]).Effectiveness of the PSO Strategy.
To demonstrate the advantage of the particleswarm optimization strategy, we compare our full algorithm to three alternatives on the Stanford 3D Scanning Repository: 1) a baseline method that simply reports the minimum particles from all initiallysampled particles, with no attempt at optimization; 2) using only a traditional PSO formulation, without guide particles; and 3) updating only the guide particles, and applying no displacement to ordinary particles.
Figure 5 compares the performance of the four alternatives. While updating guide particles alone achieves good registration results, incorporating the swarm intelligence further improves the performance, especially on range scans with overlap ratios below .
Comparisons.
To demonstrate the effectiveness of the proposed registration method, we compare it against four other alternatives: 1) a baseline method that aligns principal axes extracted with weighted PCA [36], where the weight of each vertex is proportional to its local surface area; 2) GoICP [39], which combines local ICP with a branchandbound search to find the global minima; 3) FPFH [33, 53], which matches FPFH descriptors; 4) 4PCS, a stateoftheart method that performs global registration by constructing a congruent set of 4 points between range images [4]. We do not compare with its latest variant SUPER4PCS [40] as only efficiency is improved for the latter. For GoICP, FPFH and 4PCS, we use the authors’ original implementation and tune parameters to achieve optimum performance.
Figure 6 compares the performance of the five methods on the two data sets respectively. The overall performance on the Princeton Shape Benchmark is lower as this data set is more challenging with many symmetric objects. As expected the baseline PCA method only works well when there is sufficient overlap. All previous methods experience a dramatic fall in accuracy once the overlap amount drops below ; 4PCS performs the best out of these, but because 4PCS is essentially searching for the most consistent area shared by two shapes, for small overlap ratio, it can converge to false alignments (Figure 11). Our method outperforms all previous approaches, and doesn’t experience degraded performance until overlap falls below . The average performance is summarized in Table 1.
PCA  GOICP  FPFH  4PCS  Our Method  
Stanford (%)  19.5  34.1  49.3  73.0  93.6 
Princeton (%)  18.5  22.0  33.0  73.2  81.5 
Runtime (sec)  0.01  25  3  10  0.5 
Performance on Real Data.
We further compare the performance of our registration method with 4PCS on pairs of depth maps captured from Kinect One and Structure IO sensors. The hardware setup used to obtain this data is described in detail in the next section. These depth maps share only 10%30% overlap and 4PCS often fails to compute the correct alignment as shown in Figure 8.
Limitations.
Our global registration method, like most other methods, fails to align scans with dominant symmetries since in such cases depth alone is not enough to resolve the ambiguity. This limitation holds for scans depicting large planar surfaces (e.g. walls and ground) due to continuous symmetry.
6 Dynamic Capture Results
Hardware. We provide results of our dynamic scene capture system. We experiment with two popular depth sensors, namely the Kinect One (V2) sensor and the Structure IO sensor. We mount the former on laptops and extend the capture range with long power extension cables. For the latter, we attach it to iPad Air 2 tablets and stream data to laptops through wireless network. Kinect One sensors stream highfidelity 512x424 depth images and 1920x1080 color images at 30 fps. We use it to cover the entire human body from 3 or 4 views at approximately 2 meters away. Structure IO sensors stream 640x480 for both depth and color (iPad RGB camera after compression) images at 30 fps. Per pixel depth accuracy of the Structure IO sensor is relatively low and unreliable, especially when used outdoor beyond 2 meters. Thus, we use it to capture small objects, e. g. , dogs and children, at approximately 1 meter away. Our mobile capture setting allows the subject to move freely in space in stead of being restricted to a specific capture volume.
Preprocessing. For each depth image, first we remove background by thresholding depth value and removing dominant planar segments in a RANSAC fashion. For temporal synchronization across depth sensors, we use visual cues, i. e. , jumping and clapping hands, to manually initialize the starting frame. Then we automatically synchronize all remaining frames by using the system time stamp of each frame, which is accurate up to milliseconds.
Performance. We process data using a single thread Intel Core i74710MQ CPU clocked at 2.5 GHz. It takes on average 15 seconds to globally align all the views for each frame, 5 minutes for surface denoising and reconstruction, and 3 minutes for building dense correspondences and texture reconstruction.
Results. We capture a variety of motions and objects, including walking, jumping, playing Tai Chi and dog training (see the supplementary material for a complete list). For all captures, the performer(s) are able to move freely in space while 3 or 4 people follow them with depth sensors. As shown in Figure 9, our geometry reconstruction method reduces flickering artifacts of the original Poisson reconstruction, and our texture reconstruction method recovers reliable texture on occluded areas. Figure 10 provides several examples that demonstrate the effectiveness and flexibility of our capture system. Our global registration method plays a key role as most range images share only 10% to 30% overlap. While we demonstrate successful sequences with 3 depth sensors, using an additional sensor typically improves the reconstruction quality since it provides higher overlap between neighboring views leading to a more robust registration.
As opposed to most existing freeform surface reconstruction techniques, our method can handle performances of subjects that move through a long trajectory instead of being constrained to a capture volume. Since our method does not require a template, it is not restricted to human performances and can successfully capture animals for which obtaining a static template would be challenging. The global registration method employed for each frame effectively reduces drift for long capture sequences. We can recover plausible textures even in regions that are not fully captured by the sensors using textures from frames where they are visible.
7 Conclusion
We have demonstrated that it is possible, using only a small number of synchronized consumergrade handheld sensors, to reconstruct fullytextured moving humans, and without restricting the subject to the constrained environment required by stage setups with calibrated sensor arrays. Our system does not require a template geometry in advance and thus can generalize well to a variety of subjects including animals and small children. Since our system is based on lowcost devices and works in fully unconstrained environments, we believe our system is an important step toward accessible creation of VR and AR content for consumers. Our results depend critically on our new alignment algorithm based on the visibility error metric, which can reliably align partial scans with much less overlap than is required by current stateoftheart registration algorithms. Without this alignment algorithm, we would need to use many more sensors, and solve the sensor interference problem that would arise. We believe this algorithm is an important contribution on its own, as it represents a significant step forward in global registration.
References
 [1] Collet, A., Chuang, M., Sweeney, P., Gillett, D., Evseev, D., Calabrese, D., Hoppe, H., Kirk, A., Sullivan, S.: Highquality streamable freeviewpoint video. In: ACM SIGGRAPH. Volume 34., ACM (July 2015) 69:1–69:13
 [2] Ye, G., Deng, Y., Hasler, N., Ji, X., Dai, Q., Theobalt, C.: Freeviewpoint video of human actors using multiple handheld kinects. IEEE Transactions on Cybernetics 43(5) (2013) 1370–1382
 [3] Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: Shape completion and animation of people. ACM Trans. Graph. 24(3) (July 2005) 408–416
 [4] Aiger, D., Mitra, N.J., CohenOr, D.: 4points congruent sets for robust pairwise surface registration. In: ACM Transactions on Graphics (TOG). Volume 27., ACM (2008) 85
 [5] Debevec, P.: The Light Stages and Their Applications to Photoreal Digital Actors. In: SIGGRAPH Asia, Singapore (November 2012)
 [6] Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., Davison, A., Fitzgibbon, A.: Kinectfusion: Realtime 3d reconstruction and interaction using a moving depth camera. In: UIST, New York, NY, USA, ACM (2011) 559–568
 [7] Tong, J., Zhou, J., Liu, L., Pan, Z., Yan, H.: Scanning 3d full human bodies using kinects. IEEE TVCG 18(4) (April 2012) 643–650
 [8] Li, H., Vouga, E., Gudym, A., Luo, L., Barron, J.T., Gusev, G.: 3d selfportraits. In: ACM SIGGRAPH Asia. Volume 32., ACM (November 2013) 187:1–187:9
 [9] Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: Reconstruction and tracking of nonrigid scenes in realtime. In: IEEE CVPR. (June 2015)
 [10] Dou, M., Taylor, J., Fuchs, H., Fitzgibbon, A., Izadi, S.: 3d scanning deformable objects with a single rgbd sensor. In: IEEE CVPR. (June 2015) 493–501
 [11] Li, H., Yu, J., Ye, Y., Bregler, C.: Realtime facial animation with onthefly correctives. In: ACM SIGGRAPH. Volume 32., ACM (July 2013) 42:1–42:10
 [12] Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and robust hand tracking from depth. In: IEEE CVPR, IEEE (2014) 1106–1113
 [13] Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Tracking the articulated motion of two strongly interacting hands. In: IEEE CVPR, IEEE (2012) 1862–1869
 [14] Bogo, F., Black, M.J., Loper, M., Romero, J.: Detailed fullbody reconstructions of moving people from monocular RGBD sequences. (December 2015) 2300–2308
 [15] Vlasic, D., Baran, I., Matusik, W., Popović, J.: Articulated mesh animation from multiview silhouettes. In: ACM SIGGRAPH. SIGGRAPH ’08, New York, NY, USA, ACM (2008) 97:1–97:9
 [16] de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multiview video. In: ACM SIGGRAPH, New York, NY, USA, ACM (2008) 98:1–98:10
 [17] Li, H., Adams, B., Guibas, L.J., Pauly, M.: Robust singleview geometry and motion reconstruction. In: ACM SIGGRAPH Asia. SIGGRAPH Asia ’09, New York, NY, USA, ACM (2009) 175:1–175:10
 [18] Zollhöfer, M., Nießner, M., Izadi, S., Rehmann, C., Zach, C., Fisher, M., Wu, C., Fitzgibbon, A., Loop, C., Theobalt, C., Stamminger, M.: Realtime nonrigid reconstruction using an rgbd camera. In: ACM SIGGRAPH. Volume 33., New York, NY, USA, ACM (July 2014) 156:1–156:12
 [19] Wu, C., Stoll, C., Valgaerts, L., Theobalt, C.: Onset performance capture of multiple actors with a stereo camera. ACM Trans. Graph. 32(6) (November 2013) 161:1–161:11

[20]
Liu, Y., Ye, G., Wang, Y., Dai, Q., Theobalt, C.:
Human Performance Capture Using Multiple Handheld Kinects.
In: Computer Vision and Machine Learning with RGBD Sensors. Springer International Publishing, Cham (2014) 91–108
 [21] Wand, M., Jenke, P., Huang, Q., Bokeloh, M., Guibas, L., Schilling, A.: Reconstruction of deforming geometry from timevarying point clouds. In: SGP. SGP ’07 (2007) 49–58
 [22] Wand, M., Adams, B., Ovsjanikov, M., Berner, A., Bokeloh, M., Jenke, P., Guibas, L., Seidel, H.P., Schilling, A.: Efficient reconstruction of nonrigid shape and motion from realtime 3d scanner data. ACM TOG 28(2) (May 2009) 15:1–15:15
 [23] Süßmuth, J., Winter, M., Greiner, G.: Reconstructing animated meshes from timevarying point clouds. In: SGP. SGP ’08 (2008) 1469–1476
 [24] Tevs, A., Berner, A., Wand, M., Ihrke, I., Bokeloh, M., Kerber, J., Seidel, H.P.: Animation cartography—intrinsic reconstruction of shape and motion. ACM TOG 31(2) (April 2012) 12:1–12:15
 [25] Vlasic, D., Peers, P., Baran, I., Debevec, P., Popović, J., Rusinkiewicz, S., Matusik, W.: Dynamic shape capture using multiview photometric stereo. In: ACM SIGGRAPH Asia. SIGGRAPH Asia ’09 (2009) 174:1–174:11
 [26] Li, H., Luo, L., Vlasic, D., Peers, P., Popović, J., Pauly, M., Rusinkiewicz, S.: Temporally coherent completion of dynamic shapes. ACM TOG 31(1) (February 2012) 2:1–2:11
 [27] BojsenHansen, M., Li, H., Wojtan, C.: Tracking surfaces with evolving topology. ACM Transactions on Graphics (SIGGRAPH 2012) 31(4) (2012) 53:1–53:10
 [28] Zhang, Z.: Iterative point matching for registration of freeform curves and surfaces. IJCV 13(2) (1994) 119–152
 [29] Chen, Y., Medioni, G.: Object modeling by registration of multiple range images. In: ICRA, IEEE (1991) 2724–2729
 [30] Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3d scenes. Pattern Analysis and Machine Intelligence, IEEE Transactions on 21(5) (1999) 433–449
 [31] Gelfand, N., Mitra, N.J., Guibas, L.J., Pottmann, H.: Robust global registration. In: Symposium on geometry processing. Volume 2. (2005) 5
 [32] Rusu, R.B., Blodow, N., Marton, Z.C., Beetz, M.: Aligning point cloud views using persistent feature histograms. In: Intelligent Robots and Systems, 2008 IEEE/RSJ International Conference on, IEEE (2008) 3384–3391
 [33] Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (fpfh) for 3d registration. In: Robotics and Automation, 2009 IEEE International Conference on, IEEE (2009) 3212–3217
 [34] Makadia, A., Patterson, A., Daniilidis, K.: Fully automatic registration of 3d point clouds. In: CVPR, 2006 IEEE Conference on. Volume 1., IEEE (2006) 1297–1304
 [35] Horn, B.K.: Extended gaussian images. Proceedings of the IEEE 72(12) (1984) 1671–1686
 [36] Chung, D.H., Yun, I.D., Lee, S.U.: Registration of multiplerange views using the reversecalibration technique. Pattern Recognition 31(4) (1998) 457–464
 [37] Chen, C.S., Hung, Y.P., Cheng, J.B.: Ransacbased darces: A new approach to fast automatic registration of partially overlapping range images. Pattern Analysis and Machine Intelligence, IEEE Transactions on 21(11) (1999) 1229–1234
 [38] Silva, L., Bellon, O.R., Boyer, K.L.: Precision range image registration using a robust surface interpenetration measure and enhanced genetic algorithms. Pattern Analysis and Machine Intelligence, IEEE Transactions on 27(5) (2005) 762–776
 [39] Yang, J., Li, H., Jia, Y.: Goicp: Solving 3d registration efficiently and globally optimally. In: Computer Vision (ICCV), 2013 IEEE International Conference on, IEEE (2013) 1457–1464
 [40] Mellado, N., Aiger, D., Mitra, N.J.: Super 4pcs fast global pointcloud registration via smart indexing. In: Computer Graphics Forum. Volume 33., Wiley Online Library (2014) 205–215
 [41] Moezzi, S., Tai, L.C., Gerard, P.: Virtual view generation for 3d digital video. MultiMedia, IEEE 4(1) (Jan 1997) 18–26
 [42] Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Imagebased visual hulls. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH ’00, New York, NY, USA, ACM Press/AddisonWesley Publishing Co. (2000) 369–374
 [43] Franco, J., Lapierre, M., Boyer, E.: Visual shapes of silhouette sets. In: 3D Data Processing, Visualization, and Transmission, Third International Symposium on. (June 2006) 397–404
 [44] Ahmed, N., Theobalt, C., Dobrev, P., Seidel, H.P., Thrun, S.: Robust fusion of dynamic shape and normal capture for highquality reconstruction of timevarying geometry. In: IEEE CVPR. (June 2008) 1–8
 [45] Starck, J., Hilton, A.: Surface capture for performancebased animation. IEEE Comput. Graph. Appl. 27(3) (May 2007) 21–31
 [46] Wu, C., Varanasi, K., Liu, Y., Seidel, H.P., Theobalt, C.: Shadingbased dynamic shape refinement from multiview video under general illumination, IEEE (November 2011) 1108–1115
 [47] Hernández, C., Schmitt, F., Cipolla, R.: Silhouette coherence for camera calibration under circular motion. Pattern Analysis and Machine Intelligence, IEEE Transactions on 29(2) (2007) 343–349
 [48] Wei, L., Huang, Q., Ceylan, D., Vouga, E., Li, H.: Dense human body correspondences using convolutional networks. In: IEEE CVPR, IEEE (2016)
 [49] Kennedy, J.: Particle swarm optimization. In: Encyclopedia of Machine Learning. Springer (2010) 760–766
 [50] Shoemake, K.: Uniform random rotations. In: Graphics Gems III, Academic Press Professional, Inc. (1992) 124–132
 [51] Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The princeton shape benchmark. In: Shape modeling applications, 2004. Proceedings, IEEE (2004) 167–178
 [52] Rusinkiewicz, S., Levoy, M.: Efficient variants of the icp algorithm. In: 3D Digital Imaging and Modeling, IEEE (2001) 145–152
 [53] Rusu, R.B., Cousins, S.: 3d is here: Point cloud library (pcl). In: Robotics and Automation (ICRA), 2011 IEEE International Conference on, IEEE (2011) 1–4
 [54] Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Proceedings of the fourth Eurographics symposium on Geometry processing. Volume 7. (2006)
 [55] Chuang, M., Luo, L., Brown, B.J., Rusinkiewicz, S., Kazhdan, M.: Estimating the laplacebeltrami operator by restricting 3d functions. In: Computer Graphics Forum. Volume 28., Wiley Online Library (2009) 1475–1484
 [56] Chuang, M., Luo, L., Brown, B.J., Rusinkiewicz, S., Kazhdan, M.: Estimating the LaplaceBeltrami operator by restricting 3D functions. Symposium on Geometry Processing (July 2009)
 [57] Zhou, Q.Y., Koltun, V.: Color map optimization for 3d reconstruction with consumer depth cameras. In: ACM SIGGRAPH. Volume 33., ACM (July 2014) 155:1–155:10
Supplemental to Section 4.2 – Particle Update Methods
This section discusses in detail the update methods for guide particles and regular particles during the particle swam optimization.
Guide Particle Update.
Here we describe how we update guide particle ( is the particle index and is the current frame number). We parameterize the tangent space of at by with , where is the crossproduct matrix . For any fixed , and partial scans , we can separate and into regions , as described in the main text. We then have
Minimizing with respect to is then a nonlinear least squares problem, which we use LevenbergMarquardt. We begin with initial guess and iteratively apply the quasiNewton update
(3) 
where is obtained from solving the linear system
(4) 
is a stacked column vector such that , and is the Jacobian matrix of
calculated using chain rule. The damping factor
is set as 0.1 throughout all experiments. After m converges, we set .Regular Particle Update.
Here we describe how we update regular (nonguide) particle ( is the particle index and is the current frame number). We parameterize as , where is the imaginary part of the quaternion representation of . Up to the sign of the real part, which is assumed positive, determines a unique unit quaternion representing the rotation . is updated in a traditional PSO fashion,
(5) 
where is the velocity of previous iteration, is the best location particle has been at, and is the best particle location within radius . Please refer to [49] for more details. The fixed weights , and are set as 0.2, 0.3 and 0.3 throughout all experiments. After update, the boundary condition () is checked and enforced by normalization if violated.
Supplemental to Section 3 – Surface Reconstruction Algorithm
This section summarizes the surface reconstruction method. After globally registering partial scans of each frame, we perform Poisson surface reconstruction [54] to fuse three or four partial scans ( and denote the frame and the sensor number respectively), and we obtain a sequence of complete, watertight surfaces . To reduce flickering artifacts and to fill holes, we adopt the shape completion pipeline of Li et al [26] to warp partial scans from temporallyproximate frames to the current frame geometry. For , we warp and to align with using a mesh deformation model based on pairwise correspondences and Laplacian coordinates. We further combine them all using Poisson surface reconstruction with the following weights: 10 for the reconstructed mesh of the current frame and the warped neighboring frames, 2 for the holefilled regions of the current frame, and 1 for the holefilled regions of the warped neighboring frames. This imposes a mild temporal filter on the reconstructed surfaces, and a strong filter on the holefilled regions. This step reduces the temporal flicker, and propagates some of the reconstructed surface detail from the neighboring frames onto the current frame (this stems from the neighboring reconstructed mesh weight being larger than any holefilled region weight). Please refer to [26] for more details.
Supplemental to Section 3 – Texture Reconstruction Algorithm
This section explains in detail the texture reconstruction step based on dense correspondences. After the surface reconstruction step, we first perform texture reconstruction [56] to obtain texture for
, by fusing and interpolating the texture from partial scans
( and denote the frame and the sensor number respectively). However, each surface contains regions where this texture is unreliable, either because the region had poor coverage in the partial scans, or is located near the seam between two partial scans where the texture is inconsistent due to sensor noise and variations in lighting. When capturing clothed humans using three sensors, we observe that roughly 10–20% of the texture on each surface is unreliable.The recent work of Zhou et al [57] presents impressive results on texturing scanned data. This method, however, assumes that the captured scene is static and thus is not applicable in our setting. Tracking methods like optical flow can be used to transfer texture between consecutive surfaces in our capture sequence, but we found them to be too fragile for our purposes: they fail if the deformation between frames is either too large (so that tracking fails) or too small (so that holes in coverage persist over large numbers of frames). Instead we replace unreliable texture on each surface by computing dense correspondences between and other surfaces in the sequence (including temporally distant frames), and transferring texture from surfaces whose texture at the corresponding point is reliable. With this approach we can reconstruct reliable texture even in the presence of large geometry or topology changes over time.
Reliability Weight.
We first need a measure of how reliable the reconstructed texture is at each point of each surface . Intuitively, texture is most reliable at points that directly face the camera; therefore for partial scans where p is visible, we set
where is the surface normal at and is the view direction of the sensor that captured . If is visible in multiple partial scans, we take the maximum weight, and if it is visible in none, we set . Furthermore we feather the weights of points that lie close to the boundaries of any partial scans, as texture at the seams tends to also be unreliable.
Computing Correspondences.
We adopt the method of Wei et al [48] to predict a poseinvariant descriptor for every vertex of each . The network of Wei et al is trained on a large dataset of captured and artificial human depth images, and can reliably compute a 16dimensional unit length descriptor for every vertex, where nearby points in feature space are nearlycorresponding on the surfaces.
Texture Transfer.
We declare all points with unreliable and all others reliable. We set throughout all experiments. We compute descriptors for all reliable points (across all frames) and place them in a KDtree; for each unreliable point , we compute its 50 nearest neighbors (in feature space) among reliable points, and take as the color of the weighted average of those neighbors, with each neighbor weighted by its distance from in feature space and by .
Supplemental to Section 5 – Qualitative Registration Results
Fig 11 below extends figure 7 in the main text, and shows more global registration results.
Supplemental to Section 6 – List of Captured Sequence
This section lists statistics for all captured sequences in Table 2.
Sequence  Sensor 





Walking 1  Kinect One  3  250  250,000  
Jumping  Kinect One  3  209  270,000  
Kicking  Kinect One  3  198  260,000  
Tai Chi  Kinect One  4  491  128,000  
Swimming  Kinect One  4  370  115,000  
Walking 2  Kinect One  4  201  160,000  
Dog 1  Kinect One  4  441  150,000  
Dog 2  Structure IO  4  300  145,000 
Supplemental to Section 6 – Limitations of Capture System
This section covers limitations of the proposed capture system. The global registration fails when there is barely no overlap, i. e. , below , potentially caused by two neighboring sensors drifting apart. Our method fails to capture fast motion, e. g. , jumping, due to minor asynchronization across different sensors (Figure 12). Because of the sparse views, there can be potentially consistent occluded regions, for which the texture cannot be accurately recovered from other frames (Figure 12). Finally, in large occluded regions, Poisson reconstruction might fill in missing surface data with geometry far from the ground truth human shape. In the future we wish to repair these regions by propagating details using a similar approach to how we fix the texture.
Comments
There are no comments yet.