This paper presents a novel unsupervised approach to reconstruct human shape and pose from noisy point cloud. Traditional approaches search for correspondences and conduct model fitting iteratively where a good initialization is critical. Relying on large amount of dataset with ground-truth annotations, recent learning-based approaches predict correspondences for every vertice on the point cloud; Chamfer distance is usually used to minimize the distance between a deformed template model and the input point cloud. However, Chamfer distance is quite sensitive to noise and outliers, thus could be unreliable to assign correspondences. To address these issues, we model the probability distribution of the input point cloud as generated from a parametric human model under a Gaussian Mixture Model. Instead of explicitly aligning correspondences, we treat the process of correspondence search as an implicit probabilistic association by updating the posterior probability of the template model given the input. A novel unsupervised loss is further derived that penalizes the discrepancy between the deformed template and the input point cloud conditioned on the posterior probability. Our approach is very flexible, which works with both complete point cloud and incomplete ones including even a single depth image as input. Our network is trained from scratch with no need to warm-up the network with supervised data. Compared to previous unsupervised methods, our method shows the capability to deal with substantial noise and outliers. Extensive experiments conducted on various public synthetic datasets as well as a very noisy real dataset (i.e. CMU Panoptic) demonstrate the superior performance of our approach over the state-of-the-art methods. Code can be found <https://github.com/wangsen1312/unsupervised3dhuman.git>READ FULL TEXT VIEW PDF
3D point cloud registration is a fundamental problem in computer vision ...
Large-scale point cloud generated from 3D sensors is more accurate than ...
This paper presents a novel randomized algorithm for robust point cloud
We reconstruct a closed denoised curve from an unstructured and highly n...
Iterative Closest Point (ICP) solves the rigid point cloud registration
In this paper, we propose the USIP detector: an Unsupervised Stable Inte...
Non-Rigid Structure from Motion (NRSfM) refers to the problem of
With the rapid development of sensing technology, it becomes increasingly popular to digitize human in 3D scans dou2016fusion4d; yu2018doublefusion. This gives rise to a surging demand for techniques to edit, control and animate the acquired 3D content which often involves turning a 3D human scan into a SMPL model loper2015smpl or similar parametric rigged representations. Meanwhile, it is usually difficult to annotate point clouds captured in real life. This motivates us to investigate in this paper an unsupervised approach to fit a parametric rigged representation into an input human scan. Without loss of generality, we focus on fitting a SMPL model to a 3D human point cloud.
Traditional approaches address this parametric model based fitting problem using iterative closest point (ICP) strategiesbesl1992icp; pons2017clothcap
. The main idea is to iteratively search for the correspondences between the parametric model and the input scan, then re-estimate the parameters involving the pose, shape and non-rigid surface displacement of the template model. They however heavily rely on a good initialization to avoid being stuck into a local minimum solution. Existing learning based approacheswei2016dense; litany2017deepfunctional; ginzburg2020cyclic, on the other hand, typically rely on well-annotated training set, being synthetic or real-world datasets. Moreover, the input point clouds are assumed to have relatively clean surfaces. They are nonetheless difficult to deal with practical scenarios, where it is often difficult to obtain 3D shape annotations, and outliers are prevalent in the input scans. Last but not least, existing works usually requires as input the presence of a complete point cloud from full-view 3D scans. This severely limits their use when only incomplete point cloud from partial-view (e.g. a depth image) is available, a more common scenario in practice.
These observations motivate us to propose an unsupervised probabilistic-based 3D fitting approach that can cope with noisy and partial input scans. Specifically, we model the probability of the vertices on the input point cloud with a Gaussian Mixture Model (GMM) where the centroids of GMM are the vertices on the human template. In addition, we introduce a probability for the input vertices of being outliers. Basically, the network takes the point cloud as input and predicts the parameters of the template model. In the forward pass, the probabilistic correspondence association is updated as the posterior probability of the template model given the point cloud; from that we define a novel loss function which minimizes the discrepancy between the deformed template and the point cloud conditioned on the posterior probability. The network gets trained with the proposed loss function and the parameters of the template models gets updated in the backward pass of the network.
The overall structure is shown in Fig. 1. To illustrate the updating process, in Fig. 1 we mark three vertices over the input scan and show the updated soft correspondence association using the corresponding color over the human model. While the training proceeds, the correspondence map gets more localised and finally we get the deformed human model that fits closely to the input point cloud.
The contributions of this paper are summarized: 1) We propose a novel unsupervised method to reconstruct human shape and pose from point cloud. The network can be trained in a fully unsupervised manner with no need to warm-up the network with supervised data; 2) We propose a probabilistic loss function derived from GMM and encode the correspondences in the probability distribution which is naturally differential. It works on both complete and incomplete point cloud and is robust to outliers; 3) We evaluate the proposed method on several public datasets and outperform the current state-of-the-art methods. And we also demonstrate the effectiveness of our method in real captured noisy point cloud.
Traditional approaches deal with the model fitting of 3D scans of articulated, highly non-planar objects like human bodies Bogo2014faust; bogo2017dynamicfaust or human hands romero2017embodiedMANO; wan2019self using nonrigid ICP based techniques besl1992icp; pons2017clothcap where the correspondences matching and parametric model fitting are conducted iteratively. To account for the large pose displacement, researchers proposed to use sliding correspondences li2008global or probabilistic correspondence association ye2014real; horaud2010rigid for more effective correspondence matching. However, they generally require a good model initialization, otherwise it will easily fall into local minimum. There are also approaches that focus on surface tracking yu2018doublefusion; pons2017clothcap over the 3D video sequence, in which case they usually have a predefined starting pose.
To automate the model fitting and eliminate the dependency on initialization, researchers take advantage of the deep learning techniques and train a neural network to predict the parametric model from input point cloudli2019lbsself; jiang2019skeleton; groueix20183dcoded. Recently, implicit functions bhatnagar2020IPnet; wang2021locallyPTF; chibane2020implicit were proposed which is combined with the parametric model to restore the human surface details. Another branch is to predict dense correspondences between the template model and the input point cloud bhatnagar2020loopreg; marin2020farm, from which the parameters of the template model can be optimized. We will review the related works on correspondence matching in the following subsection. In addition to model based surface fitting, there are also efforts on general non-rigid surface registration by predicting the surface deformation feng2021recurrent where the problem of topology changes would be difficult to resolve.
Traditional methods on shape correspondence deal with the matching problem using handcrafted local shape descriptors aubry2011wave; bronstein2010scale; tombari2010unique which was supposed to maintain invariance under a wide class of transformations the shape can undergo. There are also efforts on finding globally consistent sets of maps nguyen2011optimization; kim2011blended between shapes. Operating in a low-dimension space composed of the Laplace-Beltrami basis, the Functional Maps ovsjanikov2012functional reduced dimensionality of the problem drastically by converting the point-level correspondence to the function-level correspondence.
To secure reliable correspondences, learning based methods were developed by training a neural network to predict the dense correspondences wei2016dense; litany2017deepfunctional; ginzburg2020cyclic between template model and input point cloud. For instance, targeting at human bodies, Wei wei2016dense trained a feature descriptor on depth map pixels and treated the correspondence matching as a body region classification problem. Recently Deep Functional Maps marin2020farm; litany2017deepfunctional; roufosse2019unsupervised; ginzburg2020cyclic were developed to compute correspondences across 3D shapes while optimizing for global structural properties of the surface. Another widely-used relaxation for matching problems eisenberger2020deep; solomon2016entropic; vestner2017product, Optimal Transport, rely on large-scale dense matrices (e.g. geodesic distances or heat kernels) for surface matching. More recently, a self-supervised method called Loopreg bhatnagar2020loopreg was proposed with a differential registration loop to predict correspondences and register template models to the input point cloud.
Although the learning based methods have been widely explored on model fitting as well as correspondence matching, they usually assume clean input point cloud and compute one-to-one correspondence. But the assumption can not hold for real captured data when the input point cloud is quite noisy with outliers. On the other hand, a large dataset with groundtruth human model annotations is also required to train the network. On the contrary, our proposed approach can work in a fully unsupervised manner and is also robust to outliers with our probabilistic correspondence association.
To reconstruct human surface from point cloud, we follow the template based surface fitting strategy. Specifically, we have the parametric human model and our goal is to optimize the parameters in the human model so that the deformed human model can match the input point cloud.
The traditional approaches solve the problem in an iterative way through correspondence search and model fitting. Mathematically, the human model is reconstructed by minimizing the following objective function
where denotes the correspondences in the human model for from input point cloud which are usually computed via nearest search, and the human model() gets updated by minimizing the distance function which can be point-to-point or point-to-surface distance between the correspondences. However, the optimization will fall into local minimal especially when the initial model is far away from the input.
Generally, the above distance metrics or Chamfer loss are also used as an unsupervised loss to train the network for model registration bhatnagar2020loopreg; li2019lbsself. However, similar to the local minimal in traditional model fitting approaches, it is hard for the network to converge if trained with the above distance loss without using supervised data to warm-up the network. In addition, this distance function is sensitive to outliers. When we consider the correspondence association, it is not appropriate to assign correspondences for vertices belonging to outliers, where the previous methods usually predict correspondences for every vertex of the input point cloud.
In this paper, instead of explicitly assigning correspondences, we propose a probabilistic correspondence association module and define a novel distance function as the unsupervised fitting loss which can be differential and robust to outliers. The proposed loss function allows us to train the network from scratch. Next, we will describe the derivation of our loss function in detail.
Our loss function builds upon the GMM where the input point cloud are assumed to be generated by a GMM whose centroids are the vertices of the deformed template model. To further compensate for the outliers in the input point cloud, we use a uniform distribution to define the probability of the vertices being outliers. Mathematically, the probability of each vertex of the input scancan be expressed as
where and are the number of vertices of the input point cloud and the template human model respectively.
is a normal distribution andis the approximation for the percentage of outliers, which is considered to be evenly distributed. represents the coefficients of the mixture model which can be viewed as the probability of assigning vertice in the input point cloud and the vertice on the human model as correspondence.
We use equal isotropic covariances for the Gaussian model and therefore the conditional distribution is formulated as
In the above conditioned probability function, is the parametric human model of the parameters . The proposed techniques can apply generally to any 3D statistical model and in this paper, we have used the SMPL human model which is parameterized by the shape and pose coefficients. and
are the global rotation matrix and translation vector to transform the human model to the input point cloud. More details about the SMPL model can be found in paperloper2015smpl.
Under the i.i.d assumption, the energy function is defined as minimizing the negative log-likelihood for all the vertices
Inspired by EM based optimization procedure to solve the above energy function iteratively EM77, we design the network with a probability association module in the forward pass to update the posterior distribution and resolve the human model parameters in the backward training.
Probabilistic Correspondence Association. Similar to the E-step of the EM optimization, given the parameters predicted by the network from the previous iteration, we update the posterior distribution of the vertices on the human model conditioned on the input point cloud by
where , and denotes the training iteration. In our current scenario, the posterior distribution computation corresponds to the correspondence matching between the template model and the input point cloud. Compared with LoopReg bhatnagar2020loopreg which uses a diffusion field to differentiate the correspondence matching operation, our probabilistic correspondences association is naturally differential without including extra effort.
Unsupervised Loss. Taking the computed posterior distribution into Eq. 4, we can then update the human model parameters by minimizing the following complete negative log-likelihood function.
In the above function (Eq. 6), for the typical EM optimization process, and are supposed to be updated in the M-step. In our case, taking the point cloud as input, the network predicts the parameters of the human model. But is not controlled directly by the input point cloud, instead, it is treated as a hyper-parameter in the loss function. We develop a strategy to update during the training iteration. Therefore, neglecting the second term in Eq. 6 which has no dependency with respect to the model parameters, we define our unsupervised loss function as
To prevent arbitrary poses and shapes, our training loss also include three regularization terms for the predicted human shape and pose parameters bogo2016keepsmplify; kolotouros2019SPIN,
where is a mixture of Gaussian pose prior trained with shapes fitted on marker data loper2014mosh, is a pose prior penalizing unnatural rotations of elbows and knees, while is a quadratic penalty on the shape coefficients.
Finally, the overall loss function to train the network is expressed as
Update of . In Eq. 3, the hyper-parameteris supposed to be larger since the initial human model has large distance to the input point cloud and we have great matching uncertainty. gets smaller as the iteration proceeds. When , the posterior distribution (Eq. 5) will be close to a one-hot vector and our proposed loss function will approximate the classical Chamfer distance. In this way, we can see our proposed loss function as a generalized Chamfer distance with soft correspondences.
In the typical EM optimization process, is updated via the following equation (Eq. 11), which is derived by setting the derivatives of to zero with respect to ,
As shown in the above equation, the update of relies on the posterior probability of previous iteration which requires extra efforts to memorize and maintain it. In addition, different from traditional EM optimization, will also get affected by the current status of the network. Therefore, we re-compute the posterior probability matrix with a
decreasing along with the training epoch and then update the currentvia Eq. 11.
Global Pose Estimation.
Our network predicts the human model parameters which include shape, local pose and global pose parameters. For global pose estimation, it is hard to train a neural network to directly predict the rotation matrix. Instead of using Euler angles to represent the global rotation bhatnagar2020IPnet; bhatnagar2020loopreg, in this paper we employ the 6D rotation representation which has been proved to be continuous in real Euclidean spaces and more suitable for learning zhou2019continuity; xu2019disn. Mathematically, we use the vector , , to present the rotation, from which the rotation matrix can be obtained by
where , is a normalization function, "" means cross product.
In the experimental section, we validate the effectiveness of using 6D representation in our framework.
Network Structure. Our network takes the point cloud as input and we use PointNet++ qi2017pointnetplusplus to regress the parameters of the SMPL model. The network is trained from scratch and in the forward pass the probabilistic correspondences association is updated by computing the posterior distribution of the input point cloud given the current predicted human model, and the network is trained with our proposed unsupervised loss to update the human model parameters via back-propagation.
Compatible with Complete and Incomplete Point Clouds. Previous methods on unsupervised human reconstruction or correspondence matching are usually designed to work with relatively complete point cloud bhatnagar2020loopreg; wang2021locallyPTF; li2019lbsself. To deal with in-complete point cloud, especially for the point cloud acquired from depth map where a great portion of the data is invisible, the existing works always rely on large human dataset with strong supervision bhatnagar2020IPnet. To be different, our proposed unsupervised method can naturally work with incomplete point cloud and the success comes from our implicit correspondence association in which we do not need to predict one-to-one correspondence between the template model and the input point cloud. We demonstrate the effectiveness of our method on human shape and pose reconstruction from a depth map in the experiments.
Implementation details. Our network takes a point cloud of 2048 vertices as input which is sampled from the 3D human scans with farthest point sampling. In our loss function(Eq. 9), , and is set as 20.0, 225.0 and 25.0 respectively. is initialized as 0.1. We further improve the reconstruction with an instance-level optimization. Specifically, starting from the parametric model predicted from our network, we minimize the objective function defined in Eq. 3 with EM optimization.
In this paper, we have considered three public datasets on human modeling: CAPE, FAUST and CMU Panoptic PointCloud Dataset.
The CAPE dataset ma2020cape provides 148,584 pairs of scans under clothing and registered ground truth body shapes for 15 subjects of different genders. The FAUST dataset Bogo2014faust consists of 100 training and 200 testing scans. They may include noise and have holes, typically missing part of the feet. In this paper, we have not used the training set to train our network, instead we only evaluated the correspondence matching results on the test set and made comparison with other methods. The CMU Panoptic PointCloud Dataset joo2017panoptic is captured by 10 Synchronized Kinects. Compared with other 3D point cloud datasets, the captured human surface is quite noisy and contains large amount of vertices which do not belong to the human surface. For example, there are sequences of human subject playing musical instruments. Among the captured sequences, we choose the sequences containing single human subject that have ground-truth 3D joints for testing.
We have split the CAPE dataset into training and testing set. Specifically, we use 12 subjects for training and 3 subjects for testing. We also subsampled the recordings by a factor of 5. The final training set consists of 26,004 frames while the validation set consists of 3,965 frames. We generated input point clouds by sampling 2,048 points on the surfaces of the clothed meshes and add Gaussian noise of zero mean and 1mm standard-deviation. In this paper our network was trained on the CAPE training set but evaluated directly on other datasets without further fine-tuning.
Comparison Methods. We compared with three state-of-the-art methods on human model reconstruction from point cloud, namely 3D-CODED groueix20183dcoded, IPNet bhatnagar2020IPnet and PTF wang2021locallyPTF, all of which are supervised approaches trained with groundtruth annotation. Both IPNet and PTF exploited implicit representations and SMPL based parametric model for surface fitting. 3D-CODED also adopted the SMPL template and directly regressed the deformation and predicted the deformed template model. For fair comparison, we compute the error of the predicted SMPL model. Besides, the 3D-CODED method cannot apply on the point cloud since they require a 3D mesh as input. Therefore, 3D-CODED method was not evaluated on CMU Panoptic PointCloud Dataset.
Evaluation Metrics. First, the vertex-to-vertex (V2V) error is computed as the distance between the vertex on the predicted SMPL model and the corresponding vertex on the groundtruth. Besides, we also report Chamfer distance (CD) which is a common evaluation indicator for surface registration.
In Tab. 1, we demonstrate the evaluation results of 3D-CODED, IPNet, PTF and ours.
For IPNet and 3D-CODED approaches, we use the trained network released by the authors. On the other hand, since we share the same training and testing set split with PTF, so we refer to their paper for the evaluated results. To validate the effectiveness of our method for incomplete point cloud. We have generated depth maps for the models in CAPE dataset by rendering the human mesh into randomly generated camera coordinates rotating around the human subject. As shown in Tab. 1, our proposed methods even outperforms the existing supervised approaches with smallest V2V and CD error for both complete point cloud and incomplete point cloud indicated as Depth. Since PTF and 3D-CODED cannot directly extend to incomplete point cloud, the corresponding error is not reported in the table. In addition to quantitative evaluation, we also visualize some sampled results in Fig. 2 and Fig. 3. More visualization results can be found in the supplementary.
Evaluation Metrics. For the CMU Panoptic dataset, we do not have the groundtruth SMPL for the captured point cloud. Instead, the groundtruth 3D joints are provided. Therefore, we use Mean Per Joint Position Error (MPJPE) as evaluation protocol which is computed as the average Euclidean distance of predicted joints to the ground-truth. In addition, percent of correct keypoints (PCK) error is also reported where computed joint is considered correct if its distance to the groundtruth is within a certain threshold (100 mm in this paper).
|MPJPE (mm)||PCK||MPJPE (mm)||PCK|
In Tab. 2, we show the error of joints extracted from the predicted SMPL model and also report the error on both complete and incomplete point cloud. For the complete point cloud we used the point cloud merged from all the 10 Kinects and for incomplete point cloud we used the depth map randomly selected from those Kinects. As shown in Tab. 2, the proposed method has achieved better performance with smaller error for both MPJPE and PCK. Some sampled results are displayed in Fig. 4, from which we can that the proposed method is robust to noisy input point cloud that have great outliers. More visualization results can be found in the supplementary.
Comparison Methods. We have compared with several existing approaches on correspondence matching. Specifically, FMNet litany2017deepfunctional and LoopReg bhatnagar2020loopreg can directly regress the correspondences while 3D-CODED groueix20183dcoded, LBS-AE li2019lbsself as well as our proposed approach use the predicted template model as the bridge to compute the correspondences between different scans. Among those methods, FMNet and 3D-CODED are supervised approaches while LoopReg and LBS-AE are unsupervised ones, but they still rely on supervised data to warm-up the training.
|Metrics||FMNet litany2017deepfunctional||LBS-AE li2019lbsself||3D-CODED groueix20183dcoded||LoopReg bhatnagar2020loopreg||Ours|
Evaluation on FAUST Dataset. We have compared with previous approaches of correspondence matching on FAUST dataset Bogo2014faust with the results shown in Tab. 3. We follow the evaluation metric in paper bhatnagar2020loopreg and report the correspondence error as Euclidean distance of the predicted correspondence to the groundtruth for both Inter-class and Intra-class. As shown in Tab. 3, our proposed method has achieved the best performance on both cases and the matching error was reduced in a large margin especially for the Intra-class scenario.
In this section, we conducted ablation studies on several key components of the proposed method. We implemented the following ablation studies on the CAPE dataset.
|Noise||Standard Deviation (mm)|
|IPNet bhatnagar2020IPnet V2V(mm)||28.2||57.4||62.3||67.5||73.8|
|IPNet bhatnagar2020IPnet CD(mm)||15.1||31.8||33.2||38.4||40.2|
Robustness to Noise. We have added both Gaussian noise and random outliers for the models in the CAPE test set. In Tab. 4, we can see that while increasing the standard deviation of the Gaussian noise and the percentage of random outliers, compared with IPNet bhatnagar2020IPnet that has not considered the noise, the V2V and Chamfer errors of the predicted human surface from our proposed method had a rather small increase. For example, after adding Gaussian noise with 5mm standard deviation and percentage of outliers, the V2V and Chamfer errors of our predicted human model only increased by 1.1mm and 1.3mm respectively. On the contrary, the errors of IPNet have increased greatly.
6D Pose Estimation. In Tab. 5, we demonstrate the effectiveness of using 6D representation in pose estimation. As compared with the widely used Euler angle or axis-angle representation, we have achieved better results with smaller surface error on the reconstructed human model using 6D representation zhou2019continuity.
|Ours without 6D vector||26.1||14.7||57.2||38.9|
|Ours with 6D vector||21.8||13.2||48.6||33.2|
Update of . In Fig. 5, we show sampled comparison results of the reconstructed human model w/o updating during training. Trained with small , the network will have similar performance as using Chamfer distance where it is difficult for the training to converge. On the other hand, as demonstrated in Fig. 5, with large the network lacks the ability to precisely fit to the input scan. On the contrary, with our proposed updating strategy, we can have better fitting to the input point cloud.
In this paper, a novel unsupervised approach was proposed to reconstruct human shape and pose from an input point cloud. Instead of explicitly predicting or regressing the correspondences for each vertice of the point cloud, which has no tolerance to outliers, we have adopted GMM to model the input point cloud and implicitly encoded the correspondences with probabilistic association. A novel loss function was proposed to train the network from scratch in a fully unsupervised manner, which is also robust to a significant amount of outliers. We have conducted evaluation on several public datasets and outperformed both supervised and unsupervised state-of-the-art methods. As a limitation, we have not explicitly addressed the collision problem in this paper which makes it difficult to precisely reconstruct the human surface when the human body has close self-interaction. As a future work, we could include body part information to resolve this issue.