Reconstructing non-rigid (i.e., moving and deforming) objects, such as people and animals, has wide varieties of applications, such as novel view synthesis [1, 2]. Being different from rigid object/scene reconstruction (e.g. [3, 4, 5, 6, 7, 8]), which can cast 3D reconstruction into an alignment problem, captured frames with non-rigid objects must be handled as a sequence instead of as different views of the same scene since their shape may change from one frame to the next. This makes the problem challenging.
Some approaches have addressed this problem without using any prior knowledge of the object [9, 10, 11, 12] or with using only the assumption of articulated objects (e.g. ). These approaches have an inherent weakness in synthesizing unobserved shapes and textures, which is critical for some applications because they often require view synthesis from arbitrary viewpoints, even from an unobserved direction.
If we know in advance the object we are going to capture, we can make use of some prior knowledge in order to address this issue. A 3D geometry template is one possible source of prior knowledge for full-body reconstruction. Template-based methods basically acquire a shape template of the target before actually capturing it in motion and subsequently fit the template to measurements obtained from cameras or RGB-D sensors [13, 11]. This approach largely relies on non-rigid 3D registration and may suffer from insufficient constraints over a possible motion of the target objects, which may be trapped in a local minimum far from the global one.
For capturing humans, in particular, we can use human shape models (e.g. ) as prior knowledge. Particularly, statistic human shape models (e.g. [15, 16, 17, 18]) can serve as a strong regularizer on possible variations and deformations of human bodies, such as poses and body shape (e.g. tall, short, slim, and sturdy). These statistical models are trained with a number of full-body measurements. Through reducing the number of parameters, it is more likely to find a local minimum sufficiently close to the optimal, even with a partial measurement.
Generally, human bodies exhibit non-rigid deformations according to their poses (e.g. bending arms deforms muscles, skin, and clothes), which we call pose-dependent deformations, and statistic models can describe such pose-dependent deformation only partially. Existing datasets, such as , can be used for training a statistic model but contain measurements of people only in skin-tight clothes; therefore, pose-dependent deformations of muscles and skin are encoded in the model, but those of clothes are not.
As mentioned above, the key role of such statistic human shape models is to interpolate unobserved surfaces in measurements of human bodies. However, people rarely wear skin-tight clothes in real situations, and the gap between real measurements and ones in the dataset may hinder from plausible interpolation (for example, clothing folds in unobserved volumes may be smoothed out during the fitting process). A statistic model at least cannot fill in unobserved surfaces with clothing folds, which may cause significant visual artifacts in rendered models.
This paper proposes a method for full-body reconstruction of moving non-rigid 3D objects, primarily of humans, from RGB-D measurements. Our method also uses a statistic model for rough reconstruction. In this sense, ours is similar to the method by Bogo et al. , in which they developed a multi-resolution statistic model and a sophisticated technique for fitting that simultaneously optimizes shape, a single set of textures, and a displacement map. They tested their method with people in skin-tight clothes.
In contrast, our method is designed to handle people with loose clothes, in which a statistic model does not work very well. The main idea of our method is that estimation of pose-dependent deformations in unobserved surfaces, which are represented by a relatively small number of parameters using PCA analysis. Instead of finding an accurate 3D mesh, we use a rough 3D mesh as the base shape and apply precise deformations to it. By doing this, our method can handle pose-dependent deformations, such as clothing folds.
To achieve this, we propose to use the eigen-texture method[20, 21], which embeds view- and light-dependent texture representation of each triangle in 3D meshes in low dimensional spaces, provided that the textures and deformations have a certain regularity. Under the assumption that we can measure a human body in various poses from various directions in the course of measuring the person in motion and that deformation is solely dependent on poses, we can synthesize the pose-dependent components in a human body. The main contribution of this paper is summarized as follows:
We introduce eigen-texturing to textured full-body reconstruction in order to compress texture representation as well as to synthesize textures on unobserved surfaces.
We propose eigen-deformation, which embeds the displacement between a statistic model and a fully-fitted 3D mesh into a low dimensional space, enabling the displacement estimation of unobserved surfaces with a relatively small number of parameters.
In order to estimate the parameters for eigen-texturing and eigen-deformation for unobserved body parts, we develop a neural network(NN)-based coefficient regression so as to synthesize a texture and deformation for arbitrary poses as well as viewing directions.
The difficulty in statistical shape model-based full-body reconstruction of a moving human body in loose clothing lies mostly in the reproduction of pose-dependent deformations that are not described by a statistic shape model. Our idea is to represent such deformations by texture and individual displacement of mesh vertices, both of which are embedded into low dimensional spaces (eigen-texture and eigen-deformation). The individual displacements represent the difference between the statistic shape model’s mesh and one that fully registered to the measurement and count for relatively large deformations while the texture reproduces the detail. In the same way as the eigen-texture method [20, 21]
, our system can compress the storage size used for individual textures and displacements by using a small number of eigenvectors and their coefficients. In addition, it can interpolate unobserved surfaces by using the bases for full body.
Fig. 1 shows an overview of our full-body reconstruction system. At the preprocessing stage, the system registers a statistic shape model to sequences of point clouds obtained from RGB-D measurements. We use SMPL model for non-rigid registration. With non-rigid regisitration, we obtain parameters of the statistical model (body shape parameters and a set of joint angles ), as well as mesh that is fully registered to each point cloud (bottom left of Fig. 1). As displacements, we compute the difference between and , which can be done solely from . Then, the displacements are embedded into low dimensional subspace (eigen-deformation) by using a similar way to the eigen-texture method (top left of Fig. 1). At the rendering stage, is firstly recovered from , and then the textures and the displacements are reconstructed from the coefficients, which are estimated by using a NN-based regression. Adding all the displacements to , clothing mesh is reconstructed.
Iii Non-rigid registration with SMPL
We used the SMPL model  for non-rigid regisitration to simulation data and scanned data. First, we make a few correspondence points used as anchor points between SMPL model () and the simulation or scanned data () manually. According to the anchor points, we fit the SMPL model, which is used as the naked model in the regression stage as in Fig. 7(the second model from right), by estimating the pose of it. Then, we obtain additional correspondence points using boundaries, where we calculate closest points. Using those correspondences, the SMPL model is inflated to fit the target model. If this step is abandoned, the SMPL model is frequently collapsed due to large shape differences at initial registration (e.g.left side of arm corresponds to the right side of arm). Finally, we super-sample the vertices and made correspondences between them ( and ) with the nearest neighbor method as in Fig. 7(the right-most model). After these steps, we conduct non-rigid registration by minimizing following energy function:
where is the set of anchor point indices, is the set of corresponding point indices, which are made in final step, is the set of all joint indices, is the set of vertex indices, is the k-th vertex of the SMPL model, is penalty for inplausible joint angles of elbow and knee, is penalty for large changes in joint angles, is penalty for large values, and and are weight values.
One of the widely accepted ways to reproduce deformations, especially small ones, is to use texture mapping. With the assumption of skin-tight clothes as in Bogo et al. , the texture can be a static image; however, we relax this assumption using dynamically changing textures. As a representation of texture, we use eigen-decomposition method for reducing the storage size required for texture images. In addition, our body scans obtained from two RGB-D sensors usually have some unobserved areas. The system reconstructs plausible full-body mesh thanks to the statistic shape model; however, the textures of triangles in such areas are not recoverable due to unavailability of prior knowledge on the texture. Eigen-texture finds the manifold on which textures on the same triangle lie and thus can synthesize the texture for unobserved regions.
To extract the texture of a triangle, we first assess the visibility of each triangle in mesh . The system renders with the standard OpenGL pipeline to get the depth map in the camera coordinate system. We also render a triangle ID map by color-coding the ID of each triangle. For each pixel with a certain triangle ID, the pixel is back-projected onto the corresponding triangle in and then is projected to the depth map. The pixel should be visible if the difference between the corresponding depth value in the depth map and the third coordinate of the point on the triangle is small with a certain threshold. We judge that a triangle is visible if all pixels in the triangle are visible since partly occluded triangle may significantly spoil the reconstruction. For each visible triangle , 3D positions , , and of its vertices are projected onto the RGB image to extract the texture for the triangle. Hereinafter the subscript may be omitted as long as it is not ambiguous.
Then, we apply eigen-decomposition  to each triangle, which is briefly introduced here to make the paper self-contained. Let be a matrix whose -th column
is the vectorized texture of a certain triangle that is visible, whereis in , is the number of pixels in the triangle, and is the number of frames in which the triangles are visible. can be arbitrarily set to a sufficiently large number because we can arbitrarily warp the texture. All column vectors in can be centralized, i.e., , to form using the averaged texture over the frames with being the triangle visible. We can factorize as a following equation:
where is the
-th eigenvalue andthe -th eigenvector.
We can embed a texture into the subspace spanned by a subset of eigenvectors. The low-dimensional representation (i.e., the coefficient for each eigenvector) of texture can be computed by
where is a matrix whose columns are a set of largest eigenvectors ( is the number of eigenvectors for the subspace). We can also reconstruct texture from with
This means that unobserved textures can be synthesized if we can regress the low-dimensional coefficient vector.
|3 bases 5 bases 10 bases|
Our eigen-texture method is a powerful tool to visually represent deformations caused by loose clothes. However, such texturing-based compensation may not be enough for big deformations. In order to synthesize deformations in unobserved surfaces as well as to compress the storage for storing individual vertex positions, we propose eigen-deformation, inspired by the eigen-texture method.
The basic idea is almost the same as eigen-texture, but eigen-deformation deals with vertex positions. There is an inherent difference between textures and deformations: Given that the mesh is well registered to the point cloud, the variations in textures are not significant since such variations are caused solely by local deformations, such as wrinkles. On the other hand, those in vertices come from body poses. That is, changes in, e.g., the shoulder joint angle, ends up with large changes in the vertex positions of forearms. Thus, direct application of eigen-decomposition to vertex positions may not work well.
To improve the representativity, we compute displacement vectors of each part between the statistic model mesh and the fully-registered mesh . We represent the displacement vector in a certain coordinate system associated with each body part; therefore, only the difference between and is counted in the displacement vector. Body parts are divided at each joint, and their indices are shown in Fig. 3(right).
The displacement vector of the -th vertex of the -th body part in mesh are computed by for the corresponding vertex position in , where and are the -th vertices in the -th body parts in and , and is a rigid transformation matrix between the entire body coordinate system and each body part’s coordinate system. We concatenate these displacement vectors to form a column vector , and then aggregate these displacement vectors over all frames in which the triangle is visible. These vectors are centralized as in eigen-texture and again concatenated to form matrix . We apply eigen-decomposition to to obtain eigenvectors. We can also embed/reconstruct the displacement vectors into/from the subspace spanned by the eigenvectors. Examples of cumulative contribution ratios for three body parts from measurements of a real human body are shown in Fig. 4. We can see that the ratio drastically increases with a small number of eigenvectors.
When reconstructing , we firstly recover from and using Eq. (1), and then recover using a small number of eigenvectors. Finally is calculated for each vertex of each body part and added to . Examples of recovered shapes with a small number of eigenvectors of a real human body are shown in Fig. 3, where the error in recovered shape is visualized by pseudo color, demonstrating a larger number of eigenvectors decrease the error.
Vi NN-based coefficient regression
In order for full-body reconstruction, we interpolate unobserved textures and deformations in unobserved surfaces. We do this via coefficient regression in the eigen-texture and eigen-deformation’s spaces. Provided that the illumination and RGB-D sensors are fixed in our case, the variations of the textures and the displacements of the same triangle are solely explained by the person’s pose. More specifically, joint angles represented by rotation matrices mostly determine them. This implies that coefficients for eigenvectors (or coordinates in the low-dimensional spaces) can be regressed from the rotation matrices. Therefore, we train NN-based regressors that map a joint angle (a rotation matrix) to the coefficients.
Let be the vectorization of the rotation matrix that represents a certain body part’s joint angle. Since the relationship between rotation matrices and coefficients are unknown, we use a NN with two layers to represent the nonlinearity. Our regressor gives coefficients by
where is in and in . The regressor is trained with the gradient descent algorithm. For regularization, we employ weight decay. Examples of estimated coefficient from pose are shown in Fig. 6 and reconstructed shapes are shown in Fig. 6, implying authentication of the algorithm.
|(sim data)||(sim data)||(SMPL)||(SMPL)|
|3 bases||5 bases||10 bases|
|Frame #7 #247 #416||Frame #247 #280 #319|
Vii-a Evaluation with synthetic data
We applied our method to synthetic data for evaluation purpose. We use a 3D mesh model of entire human body with/without clothes, which are commercially available. Skeleton (bones) was also attached to the mesh model so that we were able to create sequences of 3D meshes and rendered images using 3D-CG software, e.g., 3D-MAX. Since muscle deformation and cloth simulation were employed in rendering, realistic shape deformations with complicated shading effects were represented. Some examples of rendered images are shown in Fig. 7. In the following, the sequence of rendered images, 3D meshes with/without clothes, and poses are used for inputs.
First, we applied eigen-texture and eigen-deformation. Cumulative contribution ratios are shown in Fig. 10 and differences from ground truth meshes (i.e., clothed meshes) are shown in Fig. 8. As shown in the plots and figures, we can see that 10 eigenvectors are sufficient to represent original mesh. This is equivalent to 2.52% of the original data.
Next, we evaluated our regressors. Results are shown in Fig. 9. As shown in the figures, our NNs worked well with synthetic data and most of the body parts have small errors. There are large errors around the arm joint and crotch; those errors mainly came from failure in our fitting algorithm. Although those areas are usually invisible in rendered images and thus less critical for practical uses, we need to seek for a solution in our future work.
We also evaluated interpolation accuracy. We adopted two scenarios, such as short-term and long-term interpolation as explained in Fig. 13. Results for interpolation and extrapolation are shown in Fig. 13 and Fig. 13 and coefficient values for extrapolation are shown in Fig. 14. As shown in the figures, extrapolation tends to make larger errors than interpolation, however, regressed coefficients still have a similar trend to the ground truth.
The textured results with interpolated shapes are shown in Fig. 15. Considering the compression rate of 2.52%, the visual quality of the rendered images is comparable to video compression, in addition to the significant advantage of arbitrary viewpoint rendering.
Ground truth by CG
Results with texture
|Frame #247||Frame #400||Frame #440|
|captured images by Kinect||Kinect||Proposed|
|Frame #200||Frame #400|
Vii-B Demonstration with real data
We used two calibrated RGB-D sensors to obtain two sequences of a moving person from the person’s front and back for real data experiment. A pair of depth measurements from each corresponding pair of frames were integrated according to the RGB-D sensors’ relative poses to make a single point cloud. This point cloud, as well as the corresponding RGB images, still had unobserved surfaces due to, e.g., self-occlusion although they were not large. Our system was able to synthesize the deformations in such surfaces. Note that our eigen-texture and -deformation-based approach can be potentially applied even to a single RGB-D sequence.
In this paper, we presented eigen-texturing and eigen-deformation method enabling full-body reconstruction with loose clothes. By using lower-dimensional embeddings of texture and deformation, i.e., 10 coefficients for our datasets, the storage size required to store our model representation is drastically reduced. It is also capable of long-term interpolation. We evaluated our method using both synthetic and real data, proving the effectiveness of our method in both visually and quantitatively. In the future, more complicated shape like skirt should be taken into account.
This work was supported by JSPS/KAKENHI 16H02849, 16KK0151, MIC/SCOPE 171507010 and MSR CORE12.
J.-Y. Guillemaut, J. Kilner, and A. Hilton, “Robust graph-cut scene
segmentation and reconstruction for free-viewpoint video of complex dynamic
Proc. IEEE Int. Conf. Computer Vision (ICCV), 2009, pp. 809–816.
-  A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H. Hoppe, A. Kirk, and S. Sullivan, “High-quality streamable free-viewpoint video,” ACM Trans. Graphics (TOG, Proc. ACM SIGGRAPH), vol. 34, no. 4, pp. 69:1–69:13, 2015.
M. Jancosek and T. Pajdla, “Multi-view reconstruction preserving
weakly-supported surfaces,” in
Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2011, pp. 3121–3128.
-  Y. Furukawa and J. Ponce, “Accurate, dense, and robust multi-view stereopsis,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 8, pp. 1362–1376, 2010.
-  P. J. Besl and N. D. McKay, “A method for registration of 3-D shapes,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 14, no. 2, pp. 239–256, 1992.
-  S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, and A. Fitzgibbon, “Kinectfusion: Real-time 3d reconstruction and interaction using a moving depth camera,” in Proc. ACM Symposium on User Interface Software and Technology (UIST), 2011, pp. 559–568.
-  R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in IEEE Int. Symp. Mixed and Augmented Reality, 2011, pp. 127–136.
-  T. Whelan, J. McDonald, H. J. M. Kaess, M. Fallon, and J. Leonard, “Kintinuous: Spatially extended KinectFusion,” in Proc. RSS Workshop on RGB-D: Advanced Reasoning with Depth Cameras, 2012.
-  B. Amberg, S. Romdhani, and T. Vetter, “Optimal step nonrigid ICP algorithms for surface registration,” in IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8.
-  H. Li, R. W. Sumner, and M. Pauly, “Global correspondence optimization for non-rigid registration of depth scans,” 2008, pp. 1421–1430.
-  M. Zollhöfer, M. Nießner, S. Izadi, C. Rehmann, C. Zach, M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, C. Theobalt, and M. Stamminger, “Real-time non-rigid reconstruction using an rgb-d camera,” ACM Trans. Graphics (TOG, Prof. ACM SIGGRAPH), vol. 33, no. 4, pp. 156:1–156:12, 2014.
-  R. A. Newcombe, D. Fox, and S. M. Seitz, “Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 343–352.
-  H. Li, B. Adams, L. J. Guibas, and M. Pauly, “Robust single-view geometry and motion reconstruction,” ACM Trans. Graphics (TOG, Proc. ACM SIGGRAPH), vol. 28, no. 5, pp. 175:1–175:10, 2009.
-  C. Malleson, M. Klaudiny, A. Hilton, and J. Y. Guillemaut, “Single-view RGBD-based reconstruction of dynamic human geometry,” in IEEE Int. Conf, Computer Vision Workshops, 2013, pp. 307–314.
-  D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, J. Davis, and S. Cruz, “Scape: Shape completion and animation of people,” ACM Trans. Graphics (TOG, Proc. ACM SIGGRAPH), vol. 24, no. 3, pp. 408–416, 2005.
Y. Chen, Z. Liu, and Z. Zhang, “Tensor-based human body modeling,” inProc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2013, pp. 105–112.
-  F. Bogo, M. J. Black, M. Loper, and J. Romero, “Detailed full-body reconstructions of moving people from monocular rgb-d sequences,” in Proc. IEEE Int. Conf. Computer Vision (ICCV), 2015, pp. 2300–2308.
-  M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: A skinned multi-person linear model,” ACM Trans. Graphics (Proc. SIGGRAPH Asia), vol. 34, no. 6, pp. 248:1–248:16, Oct. 2015.
-  L. Pishchulin, S. Wuhrer, T. Helten, C. Theobalt, and B. Schiele, “Building statistical shape spaces for 3d human modeling,” arXiv preprint, arXiv:1503.05860, pp. 1–10, 2015.
-  K. Nishino, Y. Sato, and K. Ikeuchi, “Eigen-texture method: Appearance compression and synthesis based on a 3d model,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1257–1265, 2002.
-  Y. Nakashima, F. Okura, N. Kawai, H. Kawasaki, A. Blanco, and K. Ikeuchi, “Realtime novel view synthesis with eigen-texture regression,” Proceedings of British Machine Vision Conference), 2017.