MulayCap: Multi-layer Human Performance Capture Using A Monocular Video Camera

04/13/2020 ∙ by Zhaoqi Su, et al. ∙ 0

We introduce MulayCap, a novel human performance capture method using a monocular video camera without the need for pre-scanning. The method uses "multi-layer" representations for geometry reconstruction and texture rendering, respectively. For geometry reconstruction, we decompose the clothed human into multiple geometry layers, namely a body mesh layer and a garment piece layer. The key technique behind is a Garment-from-Video (GfV) method for optimizing the garment shape and reconstructing the dynamic cloth to fit the input video sequence, based on a cloth simulation model which is effectively solved with gradient descent. For texture rendering, we decompose each input image frame into a shading layer and an albedo layer, and propose a method for fusing a fixed albedo map and solving for detailed garment geometry using the shading layer. Compared with existing single view human performance capture systems, our "multi-layer" approach bypasses the tedious and time consuming scanning step for obtaining a human specific mesh template. Experimental results demonstrate that MulayCap produces realistic rendering of dynamically changing details that has not been achieved in any previous monocular video camera systems. Benefiting from its fully semantic modeling, MulayCap can be applied to various important editing applications, such as cloth editing, re-targeting, relighting, and AR applications.



There are no comments yet.


page 2

page 4

page 7

page 9

page 10

page 11

page 12

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human performance capture aims to reconstruct a temporally coherent representation of a person’s dynamically deforming surface (i.e., 4D reconstruction). Despite the rapid progress in the study on 4D reconstruction using multiple RGB cameras or single RGB-D camera, using a single monocular video camera for robust and accurate 4D reconstruction remains an ultimate goal because it will provide a practical and convenient way of human performance capturing in general scenarios, thus enabling the adoption of human performance capturing technology in various consumer applications, such as augmented reality, computer animation, holography telepresence, biomechanics, virtual dressing, etc. However, this problem is highly challenging and ill-posed, due to the fast motion, complex cloth appearance, non-rigid deformations, occlusions and the lack of depth information.

Due to these difficulties, there have been few attempts using a single monocular RGB camera for human performance capture. The most recent works in  [103] and [41] approach the problem by using a pre-scanned actor-specific template mesh, which requires extra labor and time to scan, making these methods hard to use for consumer applications or for human performance reconstruction using Internet videos. Moreover, these methods suffer from the limitation of using a single mesh surface to represent a human character, that is, the visible part of human skin and dressed cloth are not separated. As a consequence, common cloth-body interaction, such as layering and sliding, are poorly tracked and represented. Furthermore, once obtained from the pre-scanned template mesh, the reconstructed texture is fixed to the mesh over all the frames, resulting in unrealistic artifacts.

Without a pre-scanned model, human performance capture is a very difficult problem indeed, due to the need for resolving motion, geometry and appearance from the video frames simultaneously, without any prior knowledge about geometry and appearance. Regarding geometry, reconstruction of a free-form deformable surface from a single video is subject to ambiguity [34]. As for texture, it is hard to acquire a dynamic texture free of artifact. Specifically, complex non-rigid motions introduce spatially and temporally varying shading on the surface texture. Directly updating the observed texture on garment template to represent the motion may introduce serious stitching artifacts, even with ideal and precise geometry models. While artifact-free texture mapping can be obtained by scanning a static key model followed by deforming it in a non-rigid manner for temporal reconstruction, the resultant appearance tends to be static and unnatural.

In this paper, we propose MulayCap, a multi-layer human performance capture approach using a monocular RGB video camera that achieves dynamic geometry and texture rendering without the need of an actor-specific pre-scanned template mesh. Here the ’Mulay’ notation means that “multi-layer” representations are proposed for reconstructing geometry and texture, respectively. We use multi-layer representation in geometry reconstruction, which decompose the clothed human into multiple geometric layers, namely a naked body mesh layer and a garment piece layer. In fact, two garment layers are used, one for the upper body clothing, such as a T-shirt, and the other for pants or trousers. The upper body clothing can also be generalized to include lady’s dresses, as shown in Fig. 1. To solve the garment modeling problem, we propose a Garment-from-Video (GfV) algorithm based cloth simulation. Specifically, the garment shape parameters serve as parameters for cloth simulation and optimized by minimizing the difference between simulated cloth model and the dressed garment observed in the input video. During optimization, to avoid the exhaustive and inefficient search for garment parameters, we use gradient decent minimization with a specified number of iterations. To further align the cloth simulation results with the input images, we apply a non-rigid deformation based on the shape and feature cues in each image. We demonstrate that our proposed garment simulation and optimization framework is capable of producing high quality and dynamic geometry details from a single video.

The multi-layer representation also works for dynamic texture reconstruction, in which the input video images are decomposed into albedo layers and shading layers for generating albedo atlas with geometry details on clothing. Specifically, each input image is first decomposed into an albedo image and a shading image, then the per-frame albedo is fused with the reconstructed garments to create a static and shading-free texture layer. The albedo layers serves to maintain a temporally coherent texture basis. To obtain a realistic dynamic shape of cloth, we use the shading image to solve for the garment geometry detail with a shape-from-shading method. Finally, by compositing the detailed geometry, albedo and lighting information, we produce high quality and dynamic textured human performance rendering, which preserves the spatial and temporal coherence of dynamic textures and the detail of dynamic wrinkle on clothes.

In a nutshell, we present a novel template-free approach, called MulayCap

, for human performance capture with a single RGB camera. The use of the multi-layer representations enables a more semantic modeling of human performance, in which the body, garment pieces, albedo, shading are separately modeled and elaborately integrated to produce high quality realistic results. This approach takes full advantages of high level vision priors from existing computer vision research to yield high quality reconstruction with light-weigh input. In contrast with the existing human performance capture systems 

[57, 89, 77], our fully-semantic cloth and body reconstruction system facilitates more editing possibilities on the reconstructed human performances, such as relighting, body shape editing, cloth re-targeting, cloth appearance editing, etc., as will be shown later in the paper.

2 Related Work

While there are a large number of prominent works in human performance capture, we mainly review the works that are most related to our approach. We also summarize other related techniques including human shape and pose estimation as well as cloth simulation and capture.

Human Performance Capture

The research on Human performance capture has been well studied for many decades in computer vision and computer graphics. Most of the existing systems adopt generative optimization approaches, which can be roughly categorized into multiview-RGB-based methods and depth-based methods according to the capture setups. On the other hand, based on the representations of the captured subject, generative human performance capture methods can be classified as free-form methods, template-based deformation methods and parametric-model-based deformation methods.

For multiview-RGB-based human performance capture methods, earlier researches focus on free-form dynamic reconstruction. These methods use multiview video input by leveraging shape-from-silhouette [66, 68], multiview stereo [88, 61, 97] or photometric stereo [94] techniques. [21]

performs video-realistic interactive character animation from a 4D database captured in a multiple camera studio. Benefiting from the deep learning techniques, recent approaches try to minimize the number of used cameras (around 4)

[36, 44]. Template-based deformation methods need pre-scanned templates of the subjects before motion tracking. They can generate topology consistent and temporally coherent model sequences. Such methods take advantage of the relatively accurate pre-scanned human geometry prior and use non-rigid surface deformation [26, 20] or skeleton-based skinning techniques [93, 33, 62, 17, 99] to match the multi-view silhouettes and the stereo cues. There have been few studies focusing on temporally coherent shape and pose capture using monocular RGB video sequences. Existing works include [103] and  [41], where a pre-scanned textured 3D model is a pre-requisite for both of them. In their methods, 3D joint positions are optimized based on the CNN-based 2D and 3D joint detection results. Moreover, non-rigid surface deformation is incorporated to fit the silhouettes and photometric constraints for more accurate pose and surface deformation. In parametric-model-based deformation methods. The character specific models used in the methods above are replaced by parametric body models like [4, 63, 84, 42, 76, 87] to eliminate the pre-scanning efforts. However, parametric body models always have limited power to describe the real world detailed surface of the subject. Overall, as most of the template-based deformation methods regard the human surface as a single-piece of watertight geometry, various free-form garment motion and garment-body interactions cannot be described by the surface deformation, which also acts as a key preventer for high quality dynamic texture mapping.

Depth-based methods are relatively more efficient as the 3D surface point clouds are provided directly. Many of the previous works in this field are free-form approaches, in which an in-completed template is gradually fused given continuous depth observations. Such free-from methods start from the fusion of a general dynamic scene [71], and have been improved by considering texture constraints [45, 39] and resolving topology changes [86, 85]. Multiple depth sensor based fusion approaches [24, 73, 29, 28] have been developed to improve the robustness and accuracy through registering multi-view depth streams. Besides free-form fusion based methods, performance capturing using template-based deformation is also a well studied area. [58, 38, 106, 117] leverage pre-scanned models to account for non-rigid surfaces, while in [43, 107, 10, 95] the performance capturing problem is decomposed into articulated motion tracking and shape adaption. [111] builds BUFF Dataset which contains high quality clothed 3D scan sequences of the human, and estimates the human body shape and pose from these sequences. There are also some fusion-based approaches combining articulated templates or articulated priors for robust motion tracking and surface fusion [108, 56, 109, 102, 113].

Recently, benefiting from the success of deep learning, discriminative approaches for single image human shape and pose detection catch lots of research attention. They have demonstrated that it is possible to estimate the human shape and pose using only a single RGB image by taking advantage of the parametric body models [4, 64]. [11] optimizes the body shape and pose parameters by minimizing the distance between the detected 2D joints from a CNN-based pose detector and the projected 3D joints of the SMPL model. Follow-up works extend this approach by predicting the 3D pose and shape parameters directly. [75] proposes a two-step deep learning framework, where the first step estimates key joints and silhouettes from input images, and the second step predicts the SMPL parameters. [72] estimates SMPL parameters through body part segmentation. [48] uses a 3D regression module to estimate SMPL parameters and weak camera parameters, and it incorporates an adversarial prior to discriminate unusual poses as well. [49] uses temporal information to estimate human poses in a video. [51] leverages both the idea from [48] and [11] and combine the structures from both for iteratively optimization of the human model. [74] uses a more expressive model SMPL-X for the human face and hands. Besides, There are also some deep learning approaches for estimating the whole human 3D model or frontal depth map from a single image without using parametric models [83, 9, 114, 116, 90].

Fig. 2: The pipeline of MulayCap. Given the input monocular RGB video, the clothed human reconstruction is achieved by geometry and texture reconstruction. We first estimate the human pose and shape, and reconstruct the garment-based cloth based on the human model, then apply non-rigid cloth deformation based on semantic cloth segmentation result. The second step is to use albedo and shading images decomposed from the input frames to obtain cloth texture, geometry details and lighting, which are then combined for realistic rendering of the dynamic cloth.

Cloth Simulation and Capture  The ultimate goal of cloth simulation and cloth capture is to generate realistic 3D cloth with its dynamics. Given a 3D cloth model with its physical material parameters, the task of cloth simulation is to simulate realistic cloth dynamics even under different kinds of cloth-object interactions. Classical force-based cloth simulation methods are derived from continuum mechanics  [92], it can be a mass-spring system  [81, 23, 60] or other more physically consistent models generated by the finite element method  [12, 47]. These methods need to perform numerical time integration for simulating cloth dynamics, which include the more straightforward explicit Euler method  [79] and other more stable implicit integration methods like implicit or semi-implicit Euler method  [92, 5, 14]. The force-based cloth simulation methods can generate very realistic cloth dynamics benefiting from the physically consistent models. Note that in our MulayCap, cloth simulation is especially useful in dressing the naked body and generating plausible cloth dynamics when only 2D parametric cloth pieces and monocular color video are available. Since the highly accurate material modeling is not a requirement of our system, we use the method in  [81] for simplicity and efficiency.

Different from cloth simulation, the cloth capture methods mainly concentrate on another problem: how to digitize the real world 3D model and even the real world dynamics of the cloth. For active methods,  [98] custom designed the cloth with specific color patterns and  [16] uses the custom designed active multi-spectral lighting for accurate cloth geometry and even material capture. However, the active methods are much more complex and may not generalize to off-the-shelf clothes. The passive methods are much more popular and have been developed using different kinds of information as input: multi-view rgb  [15, 78, 89], 4D sequences  [77, 54], RGBD  [22, 110] or even single RGB  [115, 25, 104, 2, 3, 40, 69]. Among these passive methods,  [15, 78] focus on reconstructing real cloth geometries and even cloth wrinkle details using temporally coherent multi-view stereo algorithm and data driven approach.  [89] utilizes the multi-view reconstructed 4D human performances to reconstruct a physically-animatable dressed avatar.  [110] use a single RGBD camera to accomplish multi-layer human performance capture, which benefits from a physics-based performance capture procedure. Given a high quality 4D sequence,  [77] semantically digitizes the whole sequence and generates temporally coherent multi-layer meshes of both human body and the cloth, while  [54] learns a cloth specific high frequency wrinkle model based on normal mapping and use the model to wrinkle the cloth under non-captured poses.  [2] learns to reconstruct people in clothing with high accuracy, which uses a monocular video of a moving person as input.  [40] build a real-time human performance capture system, which uses RGB video as input and can reconstruct space-time coherent deforming geometry of an entire human.  [3] learns geometry details from texture map, and can infer a full body shape including cloth wrinkles, face and hair from a single RGB image.  [69] uses silhouette information of a single RGB image to infer a textured 3D human model using deep generative models.  [22] is a data driven approach, it first reconstructs a static textured model of the subject using RGBD sequence and then performs cloth parsing based on a pre-designed cloth database for static but semantic cloth capture;  [115] improves  [22] by using only a single image.  [25] learns a specific cloth model from a large amount of cloth simulation results with different bodies and poses, and uses it to infer cloth geometry directly from a single color image. Benefiting from the parametric cloth models,  [104] can reconstruct both body shape and physically plausible cloth from a single image. However, such method mainly focuses on static cloth reconstruction, thus the cloth dynamics can only be generated by simulation. While Our method can generate realistic cloth dynamic appearance given a video sequence. To conclude, on one hand, the data driven approaches above need either (captured or simulated) high quality 4D sequences or self-designed cloth databases as input, which is hard to obtain. Moreover, the generalization of such approaches remains challenging. On the other hand, current direct cloth capture approaches still need carefully designed setups or the heavy multi-view capture systems for high quality cloth capture. In our system, we use a data-driven approach to reconstruct static cloth models and propose a new direct cloth capture approach for capturing realistic cloth dynamic appearance from a monocular video footage. There are also many interesting applications of cloth simulation and capture. For example,  [96] proposes a learning method which compiles 2D sketches, parametric cloth model and parametric body model into a shared space for interactive garment design, generation and fitting. One interesting work correlates to ours is  [82], it mainly focuses on garment replacement (but not capture) given a monocular video footage and relies on manual intervention. However, our method is fully automatic and produces both realistic cloth capture and cloth replacement results.

Intrinsic Decomposition  The objective of intrinsic decomposition is to decompose a raw image into the product of its reflectance and shading. Because the decomposition of raw images is insufficiently constrained, optimization based methods often tackle the problem by carefully designed priors [6, 112, 8, 35], while deep-learning based methods incorporate learning from ground truth decomposition of raw images [70, 46, 59]. There are also sequences based solutions, using propagation methods [105, 13, 67], or considering reflectance as static while shading changes over time [65, 52]. Methods proposed in [53, 30] leverage multi-view inputs to recover the scene geometry and estimate the environment lighting.

Intrinsic decomposition has been widely applied for identifying the true colors of objects and analyzing interactions with lights in the scene. Researchers have presented various applications based on the progress in this field, such as material editing by shading modification and recolorization by defining transfer among the origin and the target reflectance [105]. Since most wrinkles and folds on the cloth mainly contribute to the shading effects of the input frames, intrinsic decomposition can be directly incorporated into our system to recover such details.

3 Method Overview

The main pipeline of our MulayCap consists of two modules, i.e., multi-layer based geometry reconstruction (see Sect. 4) and multi-layer based texturing rendering (see Sect. 5), as shown in Fig. 2. For the geometry module, we reconstruct a clothed human model for each input video frame. Each target clothed model contains separated geometry mesh layers for individual garments and the human body mesh model SMPL [64]. We first detect and optimize the human shape and pose parameters of SMPL model to get the body layer (see Sect. 4.1). We select the 2D garment pattern and automatically dress the human temporal body models using available cloth simulation methods [80] (see Sect. 4.2.1). After that, since the garments may not fit with the input images, we optimize the 2D garment shape parameters using all the 2D segmented garment pieces obtained by instance human parsing methods like  [37] from the video frames (see Sect. 4.2.2). We name this garment shape optimization method based on cloth simulation as GfV. To further align the boundary in each temporal image, we refine the non-rigid deformation of the garments based on the silhouette information in each input image (see Sect. 4.2.3).

For the texture module, to achieve temporally dynamical and artifact-free texture updating, we composite a static albedo layer and a constantly updated geometry detail layer on the 3D garments. The garment albedo layer represents a clean and shadow-free texture while the geometry detail layer describes the dynamically changing wrinkles and shadows. First, based on the obtained clothed model sequence in the geometry module, we leverage the intrinsic decomposition method in [70] to decompose the input cloth images into albedo images and shading images. Multiple albedo images are then stitched and optimized on the 3D garment to form a static albedo layer (see Sect. 5.1). For the geometry detail, we further decompose the shading images into environment lighting and surface normal images (see Sect. 5.2). The normal images are then used to solve for surface details on the 3D garments (see Sect. 5.3). In this way, by using albedo images to render surface albedo, surface detail and environment lighting, we achieve realistic cloth rendering with temporally varying wrinkle details, without the side effect of stitching texture artifacts.

4 Multi Layer Geometry

To elaborate the geometry reconstruction in MulayCap, we first describe the reconstruction of body meshes, followed by the dressing dressing-on and optimization of the garment layers.

4.1 Body Estimation

We use SMPL [64] to track the human shape and pose in each frame. Specifically, We first use HMMR method [50] to estimate initial pose parameters and shape parameters for each frame . All the

are averaged to get consistent SMPL shape for the whole sequence. We apply temporal smoothing to the pose parameters of adjacent frames to alleviate errors and jitter effects, and replace those poses with drastic changes by an interpolation of their adjacent poses. We also leverage the 2D joints of human detect by OpenPose 

[19] to further fix the inaccurate pose detected by HMMR [50] and estimate the global translation of the human model, by constraining the 2D distance between the projected 3D joint position and the detect one.

Fig. 3: 2D garment patterns for generating different clothes. (a) The 2D garment for upper cloth. (b) The 2D garment for pants. The red arrows and green arrows indicate the parameters for controlling the width and height of the clothes, respectively.
Fig. 4: Garment dressing on T-posed body. (a) The initial state of the dressing process, defined by initial garment parameters. The red arrows indicates the attractive force defined on the unseamed garment vertexes. (b) The process of seaming the garment. (c) The cloth is seamed and worn on the body, simulated with gravity and body collision. (d) Resolving garment-garment collision. (e) Simulation result on one of the input frame based on initial garment parameters. (f) Results after the garment parameter optimization under the same pose of (e).

4.2 Garment from Video (GfV)

4.2.1 Dressing

The cloth dressing task consists of two steps: a) garment pattern initialization; b) physical simulation. For reconstructing garment layers, multiple 2D template garment patterns are used for different initial 3D garment meshes. Two layers of garments are used for the upper body and the the lower body, respectively, as shown in Fig. 3. Each pattern is composed of a front piece and a back piece. The sizes of garment pieces according are specified by the estimated body shape automatically.

To drape the template garments onto a human body mesh in the standard T pose, we introduce external forces to stitch the front and back piece using physics-based simulation, as shown in Fig. 4(a)-(e). The details of cloth dressing and simulation are described below.

For the sake of efficiency, we use the efficient and classical Force-Based Mass-Spring method of [80] for physics-based simulation, which treats the cloth mesh vertexes as particles connected by springs. For cloth simulation, the external forces applied on each garment vertex include gravity and the friction between the human body and the cloth. The collision constraints are added between the cloth vertices and the human model, and between overlapping cloth like T-shirt and pants to avoid penetration.

The goal of the dressing step is to seam the two 2D pieces and stitch them as a complete 3D garment. We use the mean parameters of our parametric cloth model for the dressing step, the parameters will be further refined as described in Sect. 4.2.2. As shown in Fig. 4, to put different garment elements together on the T-posed human body, we apply attractive forces on particle pairs at the cutting edges. After about 300 rounds of simulation, a seam detection algorithm is performed to detect whether each unseamed particle-pairs is seamed successfully. If not, the initial garment parameters cannot fit the human body and will be automatically re-adjusted. For instance, if the upper cloth cannot be seamed at the waist part, the algorithm predicates that the cloth is too tight for the human model, and the corresponding garment parameters will be updated progressively until the cloth is successfully seamed.

4.2.2 Garment Parameter Estimation

After the dressing step, we estimate the garment parameters with several simulation passes. In each pass, the cloth simulation is performed on the whole sequence according to the estimated body shape and poses from Sect. 4.1. The optimization is initialized with the garment geometry from the initial simulation pass.

The garment shape is optimized by minimizing the difference between the rendered simulation results and the input image frames. Here we use cloth boundaries in all the image frames as the main constraints for fitting. Given a set of garment shape parameters , we simulate, render and measure the following energy function


where serves to regularize the updates between iterations, and is used to maximize the matching between the cloth boundary in input images and the rendered cloth boundary. For a garment worn on the body, is defined as

where is the frame index evenly sampled from the input frames, and is the segmented cloth from the input image obtained with the garment instance parsing method [37], is the simulated and rendered cloth silhouette using garment parameters , and represents the distance map of the silhouette boundary of image and is defined as


where is a threshold set as 50. Here and are the distance transform from silhouette of image and its inverse image, respectively.

Since the simulation and rendering are related to complex cloth-body interaction, the cloth vertex position cannot be directly formulated as a function of . To calculate the gradient of the energy term for Gauss-Newton iteration, we use a numerical differentiation strategy: given the garment parameters , we add a small value to its element, then reset the cloth vertex-pair constraint based on the new garment parameters, and redo the simulation so as to calculate the energy term at the new parameters as . The gradient used for Gauss-Newton iteration is then calculated as:


Our tests show that the above simulation and optimization steps need to be iterated for about 25 times to converge to get garment shape parameters that roughly fit the human body and the garment to the input images. Fig. 4(f) illustrates the garment shape optimization result over Fig. 4(e).

4.2.3 Non-rigid Deformation Refinement

Note that the garment parameters solved in Sect. 4.2.2 provides only a rough estimation and the physics-based simulation cannot describe the subtle movement of the cloth, such as wrinkles, which are critical to realistic appearance modeling. We therefore refine the garment geometry using a non-rigid deformation approach to model dynamic cloth details. We up-sample the low resolution cloth mesh used for physics-based simulation to match the pixel resolution for detailed non-rigid alignment and subsequent geometry refinement. Here, we use garment boundary to determine the displacement of each vertex in the high resolution garment mesh by minimizing the following energy function




The smoothness term used to regularize the difference between displacements of the neighboring mesh vertexes:


The regularization term is defined in the same way as to constrain the displacement magnitudes.

The energy function in 4 is minimized using the Gauss-Newton method. As the energy term is defined either on each high resolution mesh vertex, or between nearby vertexes, the energy matrix is sparse so that conjugate gradient algorithm is used in each Gauss-Newton iteration. The improvement resulting from refinement by non-rigid deformation is shown in Fig. 5(a) and Fig. 5(b). See the zoom-in for detailed boundary overlays. It can be seen that the boundary overlay of rendered mesh with optimization in Fig. 5(b) is more accurate than that in Fig. 5(a).

Fig. 5: The results of garment deformation refinement. Both (a) and (b) show the reconstructed garment meshes overlay on the input image. Please zoom in to see the overlapping boundaries. (a) The result before deformation refinement. (b) The result after deformation refinement.

5 Texture and geometry detail refinement

In this section we will consider mapping texture to the detailed garment surface reconstructed in the preceding step. Note that directly texturing and updating would introduce serious stitching artifacts due to the spatially and temporally varying shadings and shadows. To obtain artifact-free and dynamically changing surface texture, we decompose the texture into the shading layer and the albedo layer, with the former for geometry detail refinement and global lighting, and the latter for generating a static albedo map for the cloth.

Specifically, for each input image frame , we use the CNN-based intrinsic decomposition method proposed in [70] to get a reflectance image and a shading image . We then use for garment albedo map calculation and for dynamic geometry detail refinement and lighting estimation. All these three components (i.e. detailed garment geometry, albedo map and lighting) are then combined to produce realistic garment rendering.

5.1 Albedo Atlas Fusion

To generate a static albedo atlas on each 3D garment, we need to keep a well optimized texture base for reducing stitching artifacts and maintaining spatially and temporally consistent texturing. Specifically, for each garment, we build a texture U-V coordinate domain according to the 2D garment designing pattern so that each vertex on a reconstructed garment mesh is assigned to an UV coordinate.

Our albedo fusion algorithm creates the albedo atlas based on the evenly sampled albedo images . As multiple albedo pixels on different albedo images may project to the same UV coordinate, a multi-image blending algorithm is needed for creating a high quality albedo atlas. We resolve this problem by simply using an as-good-as-possible albedo pixel from the multiple images for obtaining the albedo atlas. The selection strategy resembles the multiview texture mapping schemes [31, 18], which try to select the camera that minimizes the angle between the surface normal direction and the vertex-to-camera direction. To mitigate the mosaicing seams, we follow the MRF seam optimization method [55] to remove mosaic seams without affecting fine details of the albedo. For areas unseen in the sequence, we inpaint [91] those areas to obtain a full albedo atlas. Fig. 6 illustrates the albedo atlas fusion pipeline.

Fig. 6: The pipeline of generating the albedo atlas. We iteratively combine albedo maps from evenly sampled key-frames and inpaint the unseen areas.

5.2 Shading Decomposition

The goal for shading decomposition is to estimate an incident lighting and normal images from the shading image sequence . The incident lighting is then used for shape-from-shading based geometry refinement. Following the shading based surface refinement approach in [100], we use spherical harmonics to optimize the lighting and the normal image by minimizing the energy function


which models the different the approximate shading by spherical harmonics and the observed shading over all the pixels that belong to a garment in an image frame.

We get a normal map from the recovered garment surface for lighting optimization, which is a better initialization than uniform normal map initialization. To improve accuracy, we select multiple key frames to enrich the variance of surface normals as done for the albedo atlas generation step in Sect.

5.1, and estimate the lighting using the least square method over all the pixels in these frames. We select key frames in an iterative manner. Specifically, if the pose difference between the current frame and all the previously selected key frames is larger than a threshold, we add the current frame as a new key frame; the iteration is stopped when no new key frame can be added. To constrain the range of normal estimation, we regularize and minimize the energy function based on the estimated lighting by minimizing the following energy



regularization term


is used to constrain the updating step.

Laplacian term


is used to constrain the smoothness of the normal image, where is the set of ’s neighbor pixels.

normalization term


is used to constrain the normal at every cloth pixel to be normalized.

5.3 Geometry refinement using shape-from-shading

So far, we have obtain the incident lighting of the scene. The next we use the shape-from-shading approach to refine the geometry detail using both lighting and shading image sequence to represent wrinkles and folds for realistic rendering. To compute the per-vertex displacement of cloth, we first formulate its normal as follows,


where is the set of ’s neighbors in clockwise order. Then we can formulate the energy term for the shape-from-shading based geometry refinement as follows,




where represents the shifting of all the vertexes, is the same as in Sect. 4.2.3, and is for the constraining updating magnitude.

Now the remaining issue is how to estimate the displacement of invisible vertices. Note that the motion of each vertex can be decomposed into two parts, namely, the global garment motion by body-garment simulation (see Sect. 4.2.3) and the local details that cannot be described by simulation. For the invisible vertices, we assume that their local details remain unchanged. Specifically, we calculate the global rotation of the vertex from its value in the visible frame according to the simulated normal orientation, and transform the vertex shifting according to the global rotation. For each temporal frame, we finally add a spatial smooth filtering over the boundaries between the visible and invisible regions, to mitigate the spatially inconsistent seam artifacts.

Finally, given a camera model, as the incident lighting, surface albedo and dynamic geometry details have already been obtained, we render the realistic clothed human performances using spherical harmonics rendering models [7].

Fig. 7: Some reconstruction results in our test sequences. Each pair of result contains the original image on the left and the result on the right.
Fig. 8: Free-viewpoint rendering of different human models. From left to right: input images, reconstructed models from the captured views, and the reconstructed models in two virtual views.

6 Experiments

In our experiments, we use monocular RGB videos from both the internet and our own cameras containing casual human motions, including walking, playing soccer, speech, exercising, dancing, etc. The human clothing includes pants, trousers, long-sleeve/short-sleeve T-shirt and shirt, which are represented by our designated 2D garment patterns.

Besides human instance parsing and intrinsic decomposition, the main pipeline takes around 5 hours to process a sequence of 300 frames on a 3.4GHz Intel Xeon E3-1231 processor. Specifically, the pose and shape estimation takes approximately 15 minutes, garment parameter estimation takes 2 hours for 20 iterations of parameter optimization using every key frame and garment deformation refinement takes 10-12 seconds per frame. After obtaining the deformed mesh, the albedo atlas fusion step takes 10 minutes for cloth albedo generation, and geometry refinement takes 30-40 seconds per frame. Note that the whole pipeline is implemented on CPU so significant performance improvement can be achieved by leveraging the parallel computation power of GPU.

6.1 Qualitative Results

To evaluate our method, Fig. 1, Fig. 7 and the supplemental video provide the reconstruction results of captured sequences from a monocular video camera, which show that our method is capable of generating plausible human performance capture results with detailed wrinkles and folds, as a benefit of the proposed decomposition-based geometry and albedo refinement method. Note that for the two sequences in the bottom row of Fig. 7, the human characters only perform motions facing to the camera view, without capturing his/her back with turning motions. Nevertheless, our method still generates high quality results for these kinds of motions.

As the albedo map and dynamic geometry details for the cloth mesh are maintained during motion, we can generate free-viewpoint rendering results for the clothed human model. Fig. 8 shows the 360-degree free-viewpoint rendering of the human, where the cloth details are distinct in different viewpoints. Note that in the second and the last examples of Fig. 8, the person only shows his front in the whole sequence, but with cloth simulation, we can still render plausible results from the other unseen viewpoints.

6.2 Comparisons

Fig. 9: Comparison with [1] and [109]. (a) From left to right: [1] result, input frame, our result. (b) From left to right: result by a typical non-rigid surface deformation approach using a commercial depth camera [109], input frame, our result.
Fig. 10: Qualitative and quantitative comparison with PIFu [83] using rendered 4D model in BUFF Dataset[111] as input. (a) From left to right: rendered human model, reconstruction results from different viewpoints by MulayCap and PIFu[83], error map. (b) Quantitative comparison between two methods in one 4D sequential using per-vertex average error.

We compared our human performance capture results with [1] and typical template-based deformation methods [58, 38] using a commercial RGBD camera, as shown in Fig. 9 and the supplemental video. The video avatar reconstruction method in [1] takes a single view video of human performance as input, and rectifies all the poses in the image frames to a T-pose for bundle optimization of shape. However, the subject needs to perform restrictive movement to allow accurate shape reconstruction. So it fails to work for other more generate shapes, poses and dynamic textures, as shown in Fig. 9(a). In contrast, our method works robustly even when subjects perform more casual motions with natural cloth-body interaction and dynamic texture details.

Fig. 9(b) shows the comparison with typical template-based deformation approach [58, 38]. The result on the left is obtained by first fusing the geometry and texture using the DoubleFusion [109] system, followed by skeleton driven non-rigid surface deformation to align with the depth data and the silhouette. As shown in Fig. 9(b) and the video, the texture of such non-rigid reconstruction is static, so cannot model dynamically changing surface details. In contrast, our method is able to capture the dynamically changing winkles and produce more plausible garment deformations.

We also make quantitative evaluation on BUFF Dataset [111] and compare MulayCap quantitatively with PIFu [83], which is deep learning method for reconstruct clothed human body from a single image, also without pre-scanned template. The reconstruction results and the per-vertex average error is shown in Fig. 10. As shown in Fig. 10(a), benefiting from our multi-layer representation of the model and physics-based cloth simulation, we can generate high-frequency details of the cloth, both on the front and back. The pose estimated is also consistent with the input image. Meanwhile, although the model generated by PIFu [83] looks plausible from the front view, we can see that it actually generates wrong pose of the human, also the texture on the back is not so vivid neither realistic.

As for quantitative experiments, we first put the model from both PIFu [83] and MulayCap into a consistent coordinate with the ground truth 3D model of BUFF Dataset [111], and then align the models with the ground truth one using ICP for solving the scale and relative transition of the models. The error is evaluated using nearest-neighbor L2 distance. Fig. 10(b) shows that the per-vertex error of PIFu [83] is larger than MulayCap in most frames of an input video sequence rendered from BUFF Dataset [111], which shows that with our multi-layer human performance capture method, we can generate more accurate results than the one generated using an end-to-end network.

6.3 Applications

With our proposed multiple-layer modeling ability of human performance, our method produces fully-semantic reconstruction to enable abundant editing possibilities in the following applications.

Garment Editing. Since garment is semantically modeled on the shape and texture, garment editing in terms of shape or texture can be achieved, as demonstrated in Fig. 11. Garment shape editing (upper row) allows the change of the length parameters of the T-shirt sleeve and the trousers so that the human performance of the same character with new clothing can be obtained. By combining a new albedo color of the cloth with the original shading results, we can render realistic color editing results for the reconstructed human performance as shown in bottom row.

Fig. 11: Garment editing results. The upper row is garment shape editing and the bottom row is for garment color editing. From left to right: input RGB frames, reconstructed results, results with shape editing and color editing.

Retargeting. After the cloth shape and albedo have been generated for a sequence, we can retarget the clothing to other human bodies. Recall that the human modelis represented by SMPL model, which guarantees topology-consistency between different human models. So we can calculate a non-rigid warp field between the two human bodies with different shapes but the same pose, and adopt this warp field for cloth vertex mapping between the two models. The result is shown in Fig. 12, where two target body shapes are used for the retargeting application.

Fig. 12: Clothing retargeting between different human bodies. From left to right: reconstructed clothed human models, and two retargeting results using a taller thin body shape and a fatter body shape.

Relighting. Given albedo and detailed geometry with wrinkles and folds of the garment, we can generate relighting results for the captured sequence. As shown in Fig. 13, we put the character in four different environment illuminations and apply the relighting using spherical harmonic lighting coefficients generated by the cube-map texture. The albedo and geometry details are consistent in different lighting environments.

Augmented Reality. As we can automatically generate 4D human performance with only RGB video, it can be integrated into a real video for VR/AR applications. Given a video sequence of a particular scene as well as the camera positions and orientations in each frame, we can render the human performance at a particular location in the scene. With AR glasses such as Hololens, observers can see the human performance in any viewpoint. The examples of such mixed-reality rendering are shown in Fig. 14 and the supplemental video.

Fig. 13: Relighting results in four different environmental illumination maps from  [27].
Fig. 14: Two frames of an augmented-reality application. We estimate the camera parameters using [32] and render the clothed human performance on the desk.
Fig. 15: Illustration of the failure case. (a) The input image. (b) The decomposed albedo image. (c) The decomposed shading image. (d) The rendered result.

7 Conclusion

In this paper, we present a novel method, called MulayCap, based on a multi-layer decomposition of geometry and texture for human performance capture using a single RGB video. Our method can generate novel free-view rendering of vivid cloth details and human motions from a casually captured video, either from the internet or video captured by the user. There are three main advantages of MulayCap: (1) it obviates the need for tedious human specific template scanning before real performance capture and still achieved high quality geometry reconstruction on the clothed human performances. This is made possible through the proposed GfV method based on cloth simulation techniques for estimating garment shape parameters by fitting the garment appearance to the input video sequence; (2) MulayCap achieves reaistic rendering of the dynamically changing details on the garments by using a novel technique of decoupling texture into albedo and shading layers. It is worth noting that such dynamically changing textures have not been demonstrated in any existing monocular human performance capture systems before; finally (3) benefiting from the fully semantic modeling in MulayCap, the reconstructed 4D performance naturally enables various important editing applications, such as cloth editing, re-targeting, relighting, etc.

Limitation and Discussion: MulayCap mainly focuses on the body and garment reconstruction, while the other semantic elements like head, facial expression, hand, skin and shoes would require extra efforts to be handled properly. Another deficiency is that the body motion still suffers from jittering effects, as the body shape parameters are difficult to be accurately and smoothly estimated from the video based on the available human shape and pose detection algorithms [50]. As a consequence, we cannot handle fast and extremely challenging motions, as the pose detection on challenging motions contains too many errors for cloth simulation and garment optimization. Also, although our system is robust for common cases, our cloth pattern cannot handle all possible clothes or clothes with non-common shapes.

In addition, in our pipeline, the qualities of the albedo and shading image are crucial for the final rendering results, which may be affected by the performance of intrinsic decomposition methods to a certain extent. For garments with complex texture patterns such as the lattice T-shirt shown in Fig. 15, existing intrinsic decomposition methods can hardly produce accurate results. In our case, since the shading image extracted by [70] still contains much albedo information, the geometry detail solved by our system is messed with albedo information, as shown in Fig. 15. As most of the existing intrinsic decomposition methods are intended for general scenes, a novel intrinsic decomposition method particularly designed for garments may further improve the shading and albedo estimation in our task.

As for the future work, a more precise human performance capture including hands, skins, shoes, etc., as well as a variety of garment patterns like skirts, coats, etc. are promising directions to be explored. Along with the booming of single image human body estimation research [101, 49], research attentions can be directed on how to achieve jittering-free motion reconstruction to handle more challenging motions. Overall, we believe that our paper may inspire much follow-up research towards improving the quality of convenient and efficient human performance capture using a single monocular video camera, thus facilitating and promoting applications of consumer level human performance capture.


The authors would like to thank Tsinghua University and The Hong Kong University of Science and Technology for supporting this work.


  • [1] T. Alldieck, M. A. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll (2018) Video based reconstruction of 3d people models. In IEEE CVPR, Cited by: Fig. 9, §6.2.
  • [2] T. Alldieck, M. Magnor, B. L. Bhatnagar, C. Theobalt, and G. Pons-Moll (2019-06) Learning to reconstruct people in clothing from a single RGB camera. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2.
  • [3] T. Alldieck, G. Pons-Moll, C. Theobalt, and M. Magnor (2019) Tex2Shape: detailed full human body geometry from a single image. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [4] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis (2005) SCAPE: shape completion and animation of people. ACM Trans. Graph. 24 (3), pp. 408–416. Cited by: §2, §2.
  • [5] D. Baraff and A. Witkin (1998) Large steps in cloth simulation. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp. 43–54. Cited by: §2.
  • [6] J. T. Barron and J. Malik (2015) Shape, illumination, and reflectance from shading. IEEE transactions on pattern analysis and machine intelligence 37 (8), pp. 1670–1687. Cited by: §2.
  • [7] R. Basri and D. Jacobs (2001) Lambertian reflectance and linear subspaces. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, Vol. 2, pp. 383–390. Cited by: §5.3.
  • [8] S. Bell, K. Bala, and N. Snavely (2014) Intrinsic images in the wild. ACM Transactions on Graphics (TOG) 33 (4), pp. 159. Cited by: §2.
  • [9] B. L. Bhatnagar, G. Tiwari, C. Theobalt, and G. Pons-Moll (2019-10) Multi-garment net: learning to dress 3d people from images. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [10] F. Bogo, M. J. Black, M. Loper, and J. Romero (2015) Detailed full-body reconstructions of moving people from monocular rgb-d sequences. In ICCV, pp. 2300–2308. Cited by: §2.
  • [11] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black (2016) Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In ECCV 2016, pp. 561–578. Cited by: §2.
  • [12] J. Bonet and R. D. Wood (1997) Nonlinear continuum mechanics for finite element analysis. Cambridge University Press, Cambridge. Cited by: §2.
  • [13] N. Bonneel, K. Sunkavalli, J. Tompkin, D. Sun, S. Paris, and H. Pfister (2014) Interactive intrinsic video editing. ACM Transactions on Graphics (TOG) 33 (6), pp. 197. Cited by: §2.
  • [14] S. Bouaziz, S. Martin, T. Liu, L. Kavan, and M. Pauly (2014-07) Projective dynamics: fusing constraint projections for fast simulation. ACM Trans. Graph. 33 (4), pp. 154:1–154:11. External Links: ISSN 0730-0301 Cited by: §2.
  • [15] D. Bradley, T. Popa, A. Sheffer, W. Heidrich, and T. Boubekeur (2008) Markerless garment capture. ACM Trans. Graphics (Proc. SIGGRAPH) 27 (3), pp. 99. Cited by: §2.
  • [16] G. J. Brostow, C. Hernández, G. Vogiatzis, B. Stenger, and R. Cipolla (2011-10) Video normals from colored lights. TPAMI 33 (10), pp. 2104–2114. Cited by: §2.
  • [17] T. Brox, B. Rosenhahn, J. Gall, and D. Cremers (2010) Combined region and motion-based 3d tracking of rigid and articulated objects. IEEE Trans. Pattern Anal. Mach. Intell. 32 (3), pp. 402–415. Cited by: §2.
  • [18] C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Cohen (2001) Unstructured lumigraph rendering. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 425–432. Cited by: §5.1.
  • [19] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In IEEE CVPR, pp. 1302–1310. Cited by: §4.1.
  • [20] J. Carranza, C. Theobalt, M. A. Magnor, and H. Seidel (2003) Free-viewpoint video of human actors. ACM Trans. Graph. 22 (3), pp. 569–577. Cited by: §2.
  • [21] D. Casas, M. Volino, J. Collomosse, and A. Hilton (2014-04) 4D video textures for interactive character appearance. In Eurographics, Cited by: §2.
  • [22] X. Chen, B. Zhou, F. Lu, L. Wang, L. Bi, and P. Tan (2015-10) Garment modeling with a depth camera. ACM Trans. Graph. 34 (6), pp. 203:1–203:12. External Links: ISSN 0730-0301 Cited by: §2.
  • [23] K. Choi and H. Ko (2005) Stable but responsive cloth. In ACM SIGGRAPH 2005 Courses, pp. 1. Cited by: §2.
  • [24] A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H. Hoppe, A. Kirk, and S. Sullivan (2015) High-quality streamable free-viewpoint video. ACM Trans. Graph 34 (4), pp. 69. Cited by: §2.
  • [25] R. Daněček, E. Dibra, A. C. Öztireli, R. Ziegler, and M. Gross (2017) DeepGarment: 3D Garment Shape Estimation from a Single Image. Computer Graphics Forum. External Links: ISSN 1467-8659, Document Cited by: §2.
  • [26] E. de Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H. Seidel, and S. Thrun (2008) Performance capture from sparse multi-view video. ACM Trans. Graph. 27 (3), pp. 98:1–98:10. Cited by: §2.
  • [27] P. Debevec (1998) Rendering synthetic objects into real scenes: bridging traditional and image-based graphics with global illumination and high dynamic range photography. In SIGGRAPH, pp. 189–198. Cited by: Fig. 13.
  • [28] M. Dou, P. Davidson, S. R. Fanello, S. Khamis, A. Kowdle, C. Rhemann, V. Tankovich, and S. Izadi (2017) Motion2fusion: real-time volumetric performance capture. ACM Transactions on Graphics (TOG) 36 (6), pp. 246. Cited by: §2.
  • [29] M. Dou, S. Khamis, Y. Degtyarev, P. Davidson, S. R. Fanello, A. Kowdle, S. O. Escolano, C. Rhemann, D. Kim, J. Taylor, et al. (2016) Fusion4d: real-time performance capture of challenging scenes. ACM Transactions on Graphics (TOG) 35 (4), pp. 114. Cited by: §2.
  • [30] S. Duchêne, C. Riant, G. Chaurasia, J. Lopez-Moreno, P. Laffont, S. Popov, A. Bousseau, and G. Drettakis (2015) Multi-view intrinsic images of outdoors scenes with an application to relighting. ACM Transactions on Graphics, pp. 16. Cited by: §2.
  • [31] M. Eisemann, B. De Decker, M. Magnor, P. Bekaert, E. De Aguiar, N. Ahmed, C. Theobalt, and A. Sellent (2008) Floating textures. In Computer graphics forum, Vol. 27, pp. 409–418. Cited by: §5.1.
  • [32] S. Fuhrmann, F. Langguth, and M. Goesele (2014-09) MVE – a multi-view reconstruction environment. In European Conference on Computer Vision (ECCV), Cited by: Fig. 14.
  • [33] J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, and H. Seidel (2009) Motion capture using joint skeleton tracking and surface estimation. In IEEE CVPR, pp. 1746–1753. Cited by: §2.
  • [34] M. Gallardo, T. Collins, A. Bartoli, and F. Mathias (2017) Dense non-rigid structure-from-motion and shading with unknown albedos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3884–3892. Cited by: §1.
  • [35] E. Garces, A. Muñoz, J. Lopez-Moreno, and D. Gutierrez (2012) Intrinsic images by clustering. Comput. Graph. Forum 31 (4), pp. 1415–1424. Cited by: §2.
  • [36] A. Gilbert, M. Volino, J. Collomosse, and A. Hilton (2018) Volumetric performance capture from minimal camera viewpoints. In ECCV, pp. 591–607. Cited by: §2.
  • [37] K. Gong, X. Liang, Y. Li, Y. Chen, M. Yang, and L. Lin (2018) Instance-level human parsing via part grouping network. In ECCV, pp. 805–822. Cited by: §3, §4.2.2.
  • [38] K. Guo, F. Xu, Y. Wang, Y. Liu, and Q. Dai (2015) Robust non-rigid motion tracking and surface reconstruction using l0 regularization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3083–3091. Cited by: §2, §6.2, §6.2.
  • [39] K. Guo, F. Xu, T. Yu, X. Liu, Q. Dai, and Y. Liu (2017) Real-time geometry, albedo, and motion reconstruction using a single rgb-d camera. ACM Transactions on Graphics (TOG) 36 (3), pp. 32. Cited by: §2.
  • [40] M. Habermann, W. Xu, M. Zollhoefer, G. Pons-Moll, and C. Theobalt (2019-07) LiveCap: real-time human performance capture from monocular video. ACM Transactions on Graphics, (Proc. SIGGRAPH). Cited by: §2.
  • [41] M. Habermann, W. Xu, M. Zollhöfer, G. Pons-Moll, and C. Theobalt (2018) ReTiCaM: real-time human performance capture from monocular video. CoRR abs/1810.02648. Cited by: §1, §2.
  • [42] N. Hasler, H. Ackermann, B. Rosenhahn, T. Thormählen, and H. Seidel (2010) Multilinear pose and body shape estimation of dressed subjects from image sets. In CVPR, pp. 1823–1830. Cited by: §2.
  • [43] T. Helten, M. Muller, H. Seidel, and C. Theobalt (2013) Real-time body tracking with one depth camera and inertial sensors. In ICCV, pp. 1105–1112. Cited by: §2.
  • [44] Z. Huang, T. Li, W. Chen, Y. Zhao, J. Xing, C. LeGendre, L. Luo, C. Ma, and H. Li (2018) Deep volumetric video from very sparse multi-view performance capture. In ECCV, pp. 351–369. Cited by: §2.
  • [45] M. Innmann, M. Zollhöfer, M. Nießner, C. Theobalt, and M. Stamminger (2016) VolumeDeform: real-time volumetric non-rigid reconstruction. In ECCV, pp. 362–379. Cited by: §2.
  • [46] M. Janner, J. Wu, T. D. Kulkarni, I. Yildirim, and J. Tenenbaum (2017) Self-supervised intrinsic image decomposition. In Advances in Neural Information Processing Systems, pp. 5936–5946. Cited by: §2.
  • [47] C. Jiang, T. Gast, and J. Teran (2017-07) Anisotropic elastoplasticity for cloth, knit and hair frictional contact. ACM Trans. Graph. 36 (4), pp. 152:1–152:14. External Links: ISSN 0730-0301 Cited by: §2.
  • [48] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2018) End-to-end recovery of human shape and pose. In IEEE CVPR, Cited by: §2.
  • [49] A. Kanazawa, J. Zhang, P. Felsen, and J. Malik (2018) Learning 3d human dynamics from video. CoRR abs/1812.01601. Cited by: §2, §7.
  • [50] A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik (2019) Learning 3d human dynamics from video. In Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1, §7.
  • [51] N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis (2019-10) Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In International Conference on Computer Vision, Cited by: §2.
  • [52] P. Laffont and J. Bazin (2015) Intrinsic decomposition of image sequences from local temporal variations. In Proceedings of the IEEE International Conference on Computer Vision, pp. 433–441. Cited by: §2.
  • [53] P. Laffont, A. Bousseau, and G. Drettakis (2013) Rich intrinsic image decomposition of outdoor scenes from multiple views. IEEE transactions on visualization and computer graphics 19 (2), pp. 210–224. Cited by: §2.
  • [54] Z. Lahner, D. Cremers, and T. Tung (2018-09) DeepWrinkles: accurate and realistic clothing modeling. In ECCV, Cited by: §2.
  • [55] V. Lempitsky and D. Ivanov (2007) Seamless mosaicing of image-based texture maps. In CVPR, pp. 1–6. Cited by: §5.1.
  • [56] C. Li, Z. Zhao, and X. Guo (2018) ArticulatedFusion: real-time reconstruction of motion, geometry and segmentation using a single depth camera. In ECCV, pp. 324–340. Cited by: §2.
  • [57] G. Li, C. Wu, C. Stoll, Y. Liu, K. Varanasi, Q. Dai, and C. Theobalt (2013) Capturing relightable human performances under general uncontrolled illumination. Comput. Graph. Forum 32 (2), pp. 275–284. Cited by: §1.
  • [58] H. Li, B. Adams, L. J. Guibas, and M. Pauly (2009) Robust single-view geometry and motion reconstruction. In ACM Transactions on Graphics (TOG), Vol. 28, pp. 175. Cited by: §2, §6.2, §6.2.
  • [59] Z. Li and N. Snavely (2018) Learning intrinsic image decomposition from watching the world. arXiv preprint arXiv:1804.00582. Cited by: §2.
  • [60] T. Liu, A. W. Bargteil, J. F. O’Brien, and L. Kavan (2013) Fast simulation of mass-spring systems. ACM Transactions on Graphics (TOG) 32 (6), pp. 214. Cited by: §2.
  • [61] Y. Liu, Q. Dai, and W. Xu (2010) A point-cloud-based multiview stereo algorithm for free-viewpoint video. IEEE Trans. Vis. Comput. Graph. 16 (3), pp. 407–418. Cited by: §2.
  • [62] Y. Liu, J. Gall, C. Stoll, Q. Dai, H. Seidel, and C. Theobalt (2013) Markerless motion capture of multiple characters using multiview image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 35 (11), pp. 2720–2735. Cited by: §2.
  • [63] M. Loper, N. Mahmood, and M. J. Black (2014) MoSh: motion and shape capture from sparse markers. ACM Trans. Graph. 33 (6), pp. 220:1–220:13. Cited by: §2.
  • [64] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015-10) SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34 (6), pp. 248:1–248:16. Cited by: §2, §3, §4.1.
  • [65] Y. Matsushita, S. Lin, S. B. Kang, and H. Shum (2004) Estimating intrinsic images from image sequences with biased illumination. In European Conference on Computer Vision, pp. 274–286. Cited by: §2.
  • [66] W. Matusik, C. Buehler, R. Raskar, S. J. Gortler, and L. McMillan (2000) Image-based visual hulls. In ACM Trans. Graph. (Proc. SIGGRAPH), pp. 369–374. Cited by: §2.
  • [67] A. Meka, M. Zollhöfer, C. Richardt, and C. Theobalt (2016) Live intrinsic video. ACM Transactions on Graphics (TOG) 35 (4), pp. 109. Cited by: §2.
  • [68] A. Mustafa, H. Kim, J. Guillemaut, and A. Hilton (2015) General dynamic scene reconstruction from multiple view video. In ICCV, pp. 900–908. Cited by: §2.
  • [69] R. Natsume, S. Saito, Z. Huang, W. Chen, C. Ma, H. Li, and S. Morishima (2019-06) SiCloPe: silhouette-based clothed people. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [70] T. Nestmeyer and P. V. Gehler (2017) Reflectance adaptive filtering improves intrinsic image estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 4. Cited by: §2, §3, §5, §7.
  • [71] R. A. Newcombe, D. Fox, and S. M. Seitz (2015) DynamicFusion: reconstruction and tracking of non-rigid scenes in real-time. In IEEE CVPR, Cited by: §2.
  • [72] M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele (2018-09) Neural body fitting: unifying deep learning and model based human pose and shape estimation. In International Conference on 3D Vision (3DV), Cited by: §2.
  • [73] S. Orts-Escolano, C. Rhemann, S. Fanello, W. Chang, A. Kowdle, Y. Degtyarev, D. Kim, P. L. Davidson, S. Khamis, M. Dou, et al. (2016) Holoportation: virtual 3d teleportation in real-time. In UIST, pp. 741–754. Cited by: §2.
  • [74] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black (2019) Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [75] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis (2018) Learning to estimate 3d human pose and shape from a single color image. arXiv preprint arXiv:1805.04092. Cited by: §2.
  • [76] R. Plänkers and P. Fua (2001) Tracking and modeling people in video sequences. Computer Vision and Image Understanding 81 (3), pp. 285–302. Cited by: §2.
  • [77] G. Pons-Moll, S. Pujades, S. Hu, and M. J. Black (2017-07) ClothCap: seamless 4d clothing capture and retargeting. ACM Trans. Graph. 36 (4), pp. 73:1–73:15. External Links: ISSN 0730-0301 Cited by: §1, §2.
  • [78] T. Popa, Q. Zhou, D. Bradley, V. Kraevoy, H. Fu, A. Sheffer, and W. Heidrich (2009) Wrinkling captured garments using space-time data-driven deformation. Computer Graphics Forum (Proc. Eurographics) 28 (2), pp. 427–435. Cited by: §2.
  • [79] W. H. Press (2007) Numerical recipes 3rd edition: the art of scientific computing. Cambridge university press. Cited by: §2.
  • [80] X. Provot et al. (1995) Deformation constraints in a mass-spring model to describe rigid cloth behaviour. In Graphics interface, pp. 147–147. Cited by: §3, §4.2.1.
  • [81] X. Provot (1995) Deformation constraints in a mass-spring model to describe rigid cloth behavior. In IN GRAPHICS INTERFACE, pp. 147–154. Cited by: §2.
  • [82] L. Rogge, F. Klose, M. Stengel, M. Eisemann, and M. Magnor (2014-12) Garment replacement in monocular video sequences. ACM Trans. Graph. 34 (1), pp. 6:1–6:10. External Links: ISSN 0730-0301 Cited by: §2.
  • [83] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li (2019) PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. arXiv preprint arXiv:1905.05172. Cited by: §2, Fig. 10, §6.2, §6.2.
  • [84] L. Sigal, A. Balan, and M. J. Black (2008) Combined discriminative and generative articulated pose and non-rigid shape estimation. In Advances in neural information processing systems, pp. 1337–1344. Cited by: §2.
  • [85] M. Slavcheva, M. Baust, and S. Ilic (2018) SobolevFusion: 3D Reconstruction of Scenes Undergoing Free Non-rigid Motion. In IEEE CVPR, Cited by: §2.
  • [86] M. Slavcheva, M. Baust, D. Cremers, and S. Ilic (2017) KillingFusion: non-rigid 3d reconstruction without correspondences. In IEEE CVPR, pp. 5474–5483. Cited by: §2.
  • [87] D. Song, R. Tong, J. Chang, X. Yang, M. Tang, and J. Zhang (2016) 3D body shapes estimation from dressed-human silhouettes. Comput. Graph. Forum 35 (7), pp. 147–156. Cited by: §2.
  • [88] J. Starck and A. Hilton (2007) Surface capture for performance-based animation. IEEE Computer Graphics and Applications 27 (3), pp. 21–31. Cited by: §2.
  • [89] C. Stoll, J. Gall, E. de Aguiar, S. Thrun, and C. Theobalt (2010) Video-based reconstruction of animatable human characters. ACM Trans. Graph. 29 (6), pp. 139:1–139:10. Cited by: §1, §2.
  • [90] S. Tang, F. Tan, K. Cheng, Z. Li, S. Zhu, and P. Tan (2019-10)

    A neural network for detailed human depth estimation from a single image

    In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [91] A. Telea (2004)

    An image inpainting technique based on the fast marching method

    Journal of graphics tools 9 (1), pp. 23–34. Cited by: §5.1.
  • [92] D. Terzopoulos, J. Platt, A. Barr, and K. Fleischer (1987) Elastically deformable models. In ACM Siggraph Computer Graphics, Vol. 21, pp. 205–214. Cited by: §2.
  • [93] D. Vlasic, I. Baran, W. Matusik, and J. Popovic (2008) Articulated mesh animation from multi-view silhouettes. ACM Trans. Graph. 27 (3), pp. 97:1–97:9. Cited by: §2.
  • [94] D. Vlasic, P. Peers, I. Baran, P. E. Debevec, J. Popovic, S. Rusinkiewicz, and W. Matusik (2009) Dynamic shape capture using multi-view photometric stereo. ACM Trans. Graph. 28 (5), pp. 174:1–174:11. Cited by: §2.
  • [95] A. Walsman, W. Wan, T. Schmidt, and D. Fox (2017) Dynamic high resolution deformable articulated tracking. In 3D Vision (3DV), 2017 International Conference on, pp. 38–47. Cited by: §2.
  • [96] T. Y. Wang, D. Ceylan, J. Popovic, and N. J. Mitra (2018) Learning a shared shape space for multimodal garment design. ACM Trans. Graph. 37 (6), pp. 1:1–1:14. External Links: Document Cited by: §2.
  • [97] M. Waschbüsch, S. Würmlin, D. Cotting, F. Sadlo, and M. H. Gross (2005) Scalable 3d video of dynamic scenes. The Visual Computer 21 (8-10), pp. 629–638. Cited by: §2.
  • [98] R. White, K. Crane, and D. A. Forsyth (2007) Capturing and animating occluded cloth. In ACM SIGGRAPH 2007 Papers, SIGGRAPH ’07. Cited by: §2.
  • [99] C. Wu, C. Stoll, L. Valgaerts, and C. Theobalt (2013) On-set performance capture of multiple actors with a stereo camera. ACM Trans. Graph. 32 (6), pp. 161:1–161:11. Cited by: §2.
  • [100] C. Wu, B. Wilburn, Y. Matsushita, and C. Theobalt (2011) High-quality shape from multi-view stereo and shading under general illumination. In CVPR, pp. 969–976. Cited by: §5.2.
  • [101] D. Xiang, H. Joo, and Y. Sheikh (2018) Monocular total capture: posing face, body, and hands in the wild. CoRR abs/1812.01598. Cited by: §7.
  • [102] L. Xu, Z. Su, T. Yu, Y. Liu, and L. Fang (2019) UnstructuredFusion: realtime 4d geometry and texture reconstruction using commercial rgbd cameras. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.
  • [103] W. Xu, A. Chatterjee, M. Zollhöfer, H. Rhodin, D. Mehta, H. Seidel, and C. Theobalt (2018) MonoPerfCap: human performance capture from monocular video. ACM Trans. Graph. 37 (2), pp. 27:1–27:15. Cited by: §1, §2.
  • [104] S. Yang, Z. Pan, T. Amert, K. Wang, L. Yu, T. Berg, and M. C. Lin (2018-11) Physics-inspired garment recovery from a single-view image. ACM Trans. Graph. 37 (5), pp. 170:1–170:14. External Links: ISSN 0730-0301 Cited by: §2.
  • [105] G. Ye, E. Garces, Y. Liu, Q. Dai, and D. Gutierrez (2014-07) Intrinsic video and applications. ACM Trans. Graph. 33 (4), pp. 80:1–80:11. External Links: ISSN 0730-0301 Cited by: §2, §2.
  • [106] G. Ye, Y. Liu, N. Hasler, X. Ji, Q. Dai, and C. Theobalt (2012) Performance capture of interacting characters with handheld kinects. In Computer Vision–ECCV 2012, pp. 828–841. Cited by: §2.
  • [107] M. Ye and R. Yang (2014) Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In CVPR, pp. 2345–2352. Cited by: §2.
  • [108] T. Yu, K. Guo, F. Xu, Y. Dong, Z. Su, J. Zhao, J. Li, Q. Dai, and Y. Liu (2017-10) BodyFusion: real-time capture of human motion and surface geometry using a single depth camera. In IEEE ICCV, Cited by: §2.
  • [109] T. Yu, Z. Zheng, K. Guo, J. Zhao, Q. Dai, H. Li, G. Pons-Moll, and Y. Liu (2018-06) DoubleFusion: real-time capture of human performances with inner body shapes from a single depth sensor. In IEEE CVPR, Cited by: §2, Fig. 9, §6.2.
  • [110] T. Yu, Z. Zheng, Y. Zhong, J. Zhao, Q. Dai, G. Pons-Moll, and Y. Liu (2019-06) SimulCap : single-view human performance capture with cloth simulation. In The IEEE International Conference on Computer Vision and Pattern Recognition(CVPR), Cited by: §2.
  • [111] C. Zhang, S. Pujades, M. J. Black, and G. Pons-Moll (2017-07) Detailed, accurate, human shape estimation from clothed 3d scan sequences. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, Fig. 10, §6.2, §6.2.
  • [112] Q. Zhao, P. Tan, Q. Dai, L. Shen, E. Wu, and S. Lin (2012) A closed-form solution to retinex with nonlocal texture constraints. IEEE transactions on pattern analysis and machine intelligence 34 (7), pp. 1437–1444. Cited by: §2.
  • [113] Z. Zheng, T. Yu, H. Li, K. Guo, Q. Dai, L. Fang, and Y. Liu (2018-09) HybridFusion: real-time performance capture using a single depth sensor and sparse imus. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • [114] Z. Zheng, T. Yu, Y. Wei, Q. Dai, and Y. Liu (2019-10) DeepHuman: 3d human reconstruction from a single image. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [115] B. Zhou, X. Chen, Q. Fu, K. Guo, and P. Tan (2013) Garment Modeling from a Single Image. Computer Graphics Forum. External Links: ISSN 1467-8659 Cited by: §2.
  • [116] H. Zhu, X. Zuo, S. Wang, X. Cao, and R. Yang (2019-06) Detailed human shape estimation from a single image by hierarchical mesh deformation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [117] M. Zollhöfer, M. Nießner, S. Izadi, C. Rehmann, C. Zach, M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, C. Theobalt, et al. (2014) Real-time non-rigid reconstruction using an rgb-d camera. ACM Transactions on Graphics (TOG) 33 (4), pp. 156. Cited by: §2.