1 Introduction
Garment reconstruction is a crucial technique in many applications, e.g. virtual tryon santesteban2019learning, VR/ AR volino2007virtual and visual effects spielmann2013set. Extensive efforts varol2018bodynet; jackson20183d; pifuSHNMKL19; alldieck2019learning; alldieck2018detailed; zheng2019deephuman; natsume2019siclope; tang2019neural; gabeur2019moulding; smith2019facsimile; corona2021smplicit; bhatnagar2020combining have been put into reconstructing the human body and garments as a whole with the help of implicit representation or volumetric representation.
However, it is desirable to have a controllable garment model in many applications. In this work, we focus on the parametric modeling of 3D garment, which has advantages in the following two ways. Firstly, we can separate garments from human body. Secondly, the topology of the reconstructed meshes can be controlled, and therefore allows downstream tasks that require high interpretability.
Different from previous parametric methods for garment reconstruction alldieck2019learning; alldieck2018detailed; bhatnagar2019multi; jiang2020bcnet that take 2D RGB images as input, we choose to approach the problem from the perspective of 3D, specifically point cloud sequences (shown in Fig. LABEL:fig:teaser), for the following three reasons. Firstly, 3D inputs eliminate scale and pose ambiguities that are difficult to avoid when using 2D images. Secondly, exploiting temporal information is important for garment dynamics capturing, at which there are few attempts. Thirdly, recent development in 3D sensors (e.g. LiDAR) has reduced the cost and difficulties in obtaining point clouds, which makes it easier to leverage 3D point clouds for research problems and commercial applications. However, garment reconstruction from point cloud sequences has not been properly explored by previous works. We contribute an early attempt at this meaningful task in this work. Our advantages against other garment reconstruction methods are listed in Tab. LABEL:tab:teaser.
The proposed garment reconstruction framework, Garment4D, consists of three major parts: a) sequential garments registration, b) canonical garment estimation and c) posed garment reconstruction. In the registration part, we have several sequences of garment meshes which have different mesh topology across the sequences but share same topology inside one sequence. For each type of garments (e.g.
Tshirt, trousers, skirt), we use an optimizationbased method to register one frame from each sequence to a template mesh. Then, inside each sequence, a barycentric interpolation method is used to remesh other frames to unify their topology with the template mesh. Thereafter, as the first step of garment reconstruction, following the practice of previous parametric methods
bhatnagar2019multi; jiang2020bcnet, we estimate the canonical garment mesh for each sequence by a semanticaware garment PCA coefficients encoder, which takes the dressed human point cloud sequences as input.The posed garment reconstruction part is where the challenges come in for two reasons. Firstly, due to unordered and unstructured nature of point clouds, it is nontrivial to learn lowlevel geometric features directly. Secondly, it is challenging to capture garment dynamics caused by interactions between the human body and garments. Especially, it is difficult to model the nonrigid deformation of loose garments (e.g. skirts), which depends on both current human pose and previous human motions. To address the challenges, we first apply the Interpolated Linear Blend Skinning (LBS) to the estimated canonical garments as proposals. Unlike alldieck2019learning; alldieck2018detailed; bhatnagar2019multi that depends on SMPL+D model to skin the garment, the proposed Interpolated LBS has the ability to perform skinning on loose garments without artifacts. Moreover, our method does not require learning, which is different from jiang2020bcnet. After obtaining the proposals, we present the ProposalGuided Hierarchical Feature Network along with the Iterative Graph Convolution Network (GCN) for efficient geometric feature learning. Meanwhile, a Temporal Transformer is utilized for temporal fusion to capture smooth garment dynamics.
Our contributions can be summarised as:
1) We propose a practical pipeline for the reconstruction of various garments (including challenging loose garments) that are driven by large human movements, which does not rely on any particular human body parametric model or extra annotations other than human and garment meshes.
2) Many novel learning modules, such as ProposalGuided Hierarchical Feature Network, Iterative GCN, and Temporal Transformer are proposed, which effectively learn spatiotemporal features from 3D point cloud sequences for garment reconstruction.
3) We establish a comprehensive dataset consisting of 3D point cloud sequences by adapting CLOTH3D bertiche2020cloth3d, which serves as the first benchmark dataset for evaluating posed garment reconstruction based on point cloud sequences.
In addition, this work presents an early attempt at exploiting temporal information for capturing garment motions. To the best of our knowledge, we are the first to explore garment reconstruction from point cloud sequences.
2 Our Approach
We introduce our approach, Garment4D, in this section. Firstly, our garment model is formulated in Subsection 2.1. Afterwards, we elaborate the three major steps of Garment4D: 2.2 sequential garments registration, 2.3 canonical garment estimation, and 2.4
posed garment reconstruction. Training loss functions are introduced in Subsection
2.5. An overview of the proposed Garment4D is shown in Fig. 2.2.1 3D Garment Model
We design the garment model by the following three parts:
Canonical Garment Model. Formally, the canonical garment can be formulated as , where represents the mean shape of current type of garment, is the PCA components that form garment shape subspaces, represents the PCA coefficients of the corresponding garment.
Interpolated LBS. One critical problem in skinning step is to get a set of skinning weights of garment vertices that can provide reasonable proposals of posed garments. Previous methods alldieck2019learning; alldieck2018detailed; bhatnagar2019multi approximate the garment skinning weights by using the weights of the closest human mesh vertex. However, these methods can result in mesh artifacts, and hence limit the quality of reconstructed garments for the following two facts. Firstly, the resolutions of our garment meshes is higher than that of human meshes. Secondly, loose garments like skirts do not always share same topology with human body. To avoid artifacts, we propose to interpolate the skinning weights from nearest neighbors of human mesh vertices . Laplacian smoothing is also performed to further ensure the smoothness of the skinning weights.
Displacement Prediction. For those challenging loose garments like skirts and dresses, the interpolated results are mostly not satisfactory. To this end, we further refine the results of interpolated LBS by predicting displacements . In general, the whole garment model can be formulated as
(1)  
(2) 
where represents the axisangle of each joint, , are the skinning weights of human and garment mesh vertices, is the joint locations of human body, is the garment skinning weights of nearest body vertices, is the inverse distance weights used for interpolation.
2.2 Garment Registration
In order to perform PCA decomposition to the same type of garments, we unify the mesh topology by registering the raw meshes to a template mesh . A visualization of the registration process is illustrated in Fig. 2 (a).
Considering that source meshes have large shape variance and diverse sleeve/ hem length, we use a boundaryaware optimization method to align the source meshes to the target mesh. The registration loss function
can be formulated as(3) 
where is the Chamfer Distance (CD) between source and target mesh vertices, normalizes Edge Length of the source mesh, maintains the Normal Consistency between neighbouring faces, minimizes the CD between corresponding Boundaries of source and target meshes (e.g. cuffs and neckline), and are corresponding weights for each loss term. The term makes the optimization process aware of the boundaries of the meshes, which is essential for aligning garments with large shape variance.
For each sequence, the aligned garment mesh is then used to calculate barycentric interpolation weights against the vertices of template mesh , which is further used to remesh garment meshes in other frames. Specifically, for each vertex of , the nearest face in is found. We project to and then calculate the barycentric coordinate of the projected point on . Using the barycentric coordinates, it is straightforward to remesh all the other frames of garment meshes to the topology of the template mesh .
2.3 Canonical Garment Estimation
To estimate canonical garments , we predict the PCA coefficients by using the segmented 3D points of garments from input point cloud sequences , where is the sequence length. As shown in Fig. 3 (a), we first perform pointwise semantic segmentation to select the 3D points of garments for each frame, which results in a sequence of garment point clouds . Subsequently, is used to predict the PCA coefficients for each sequence, which is further used to reconstruct target canonical garments by .
2.4 Posed Garment Reconstruction
ProposalGuided Hierarchical Feature Network. As introduced in Subsection 2.1, the skinned garments of the proposed garment model can provide reasonable initial proposals for posed garments. However, due to complex human motions and garment flexibility, especially for intense movements and loose skirts, the initial proposals of garment poses can be far different from real poses. To address these problems, we propose a novel ProposalGuided Hierarchical Feature Network, consisting of two major parts: 1) Hierarchical Garment Feature Pooling, and 2) Hierarchical Body Surface Encoder. Displacement is predicted to refine the initial garment poses. Detailed architectures of these two modules are demonstrated in Fig. 3 (b). The proposed Hierarchical Garment Feature Pooling can effectively capture local geometric details from garments. Hierarchical Body Surface Encoder facilitates predicting correct garment poses by considering interactions between garment and human body. By estimating proper contacts between garments and skin, our body surface encoder highly alleviates the interpenetration problems.

[leftmargin=*,itemsep=0pt,topsep=0pt]

Hierarchical Garment Feature Pooling aims at pooling accurate geometric features and rich semantic information for each vertex of mesh proposals. Specifically, for each vertex of , several query balls with different radii are constructed to sample points from the segmented point clouds of garments. To make use of the rich semantic information extracted when predicting the PCA coefficients , different levels of downsampled garment point clouds and the corresponding features from the previous stage are used for the query process with different radii and sample numbers. Then for each vertex, the pooled garment features are further encoded to aggregate lowlevel geometric features and highlevel semantic information.

Hierarchical Body Surface Encoder makes each vertex of the garment mesh aware of its neighbouring human body surface, which is important for correct interaction encoding between garment and human body, especially under the circumstances of large body movements. For each vertex of , neighbouring human mesh vertices are sampled using different radii of query balls. Both coordinates and normal directions of sampled human mesh vertices are further encoded.
Iterative GCN Displacement. Concatenating the pervertex features obtained from the above feature pooling and surface encoder, along with the coordinates of the vertex, we have aggregated necessary information for the following Iterative Graph Convolution Network (GCN) to predict the displacement . Loose garments skinned by the interpolated LBS are normally not in accordance with the real situation. Therefore, we choose to iteratively predict displacements and gradually approach the ground truth. As shown in Fig. 3 (b), for the th iteration, using the previous accumulated displacement predictions and the proposal , we can calculate current mesh prediction by . is then used for the ProposalGuided Hierarchical Feature Pooling of this iteration. The last GCN layer of each iteration outputs current displacement . Inspired by kipf2016semi, the GCN layer could be formulated as , where is the bidirectional mesh graph with selfconnections, , and are the learnable weights and bias.
Temporal Transformer. To support better temporal information extraction, a Transformer is integrated into the Iterative GCN. For the th iteration, we take the features extracted by the th iteration, perform temporal fusion by Transformer and concatenate them to the input of the th iteration. Specifically, we denote the output feature of the second last layer of the th iteration as . Query, key and value are obtained by applying MLP to , which could be denoted as , where . Then the value is updated as . is then concatenated to the pooled features of the th iteration.
2.5 Loss Functions
Loss functions corresponding to two parts of network are introduced in this section. For the canonical garment estimation, the loss function consists of five terms, which could be formulated as
(4) 
The first term is the cross entropy term for semantic segmentation. The second and third term use L2 loss to supervise our predicted PCA coefficients and mesh vertices , respectively. The fourth term is the interpenetration loss term, which could be formulated as
(5) 
where and
are the coordinate and normal vector of the nearest human mesh vertex of the
th garment mesh vertex . The fifth term is the Laplacian Regularization term,(6) 
where is the cotangent weighted Laplacian matrix of the garment mesh.
For the th iteration of the posed garment reconstruction, the loss function includes four terms, which are defined as
(7) 
The first term is the L2 loss term between predicted and ground truth garment mesh vertices and . The second term penalizes interpenetration between posed garment prediction and posed human body . The third term performs cotangent weighted Laplacian regularization. The last term adds Temporal Constraints to the deformations of the predicted garments, which could be formulated as
(8) 
Adding up weighted loss of each iteration, we get the loss of the posed garment reconstruction .
3 Experiments
3.1 Datasets and Evaluation Protocols
Datasets. We establish point cloud sequencebased garment reconstruction dataset by adapting CLOTH3D bertiche2020cloth3d for our experiments. CLOTH3D ^{1}^{1}1CLOTH3D is downloaded from http://chalearnlap.cvc.uab.es/dataset/38/description/. is a large scale synthetic dataset with rich garment shapes and styles and abundant human pose sequences. We sample point sets from 3D human models to produce the point cloud sequence inputs. We select three types of garments, i.e. Skirts, TShirt, Trousers, for the experiments. For skirts, sequences with frames of point clouds are used for training and testing. For Tshirts, sequences with frames are used. For trousers, sequences and frames are used. We split the sequences to training and testing sets at the ratio of .
In addition to the synthetic dataset, we also perform experiments on a real human scan dataset CAPE ma2020learning. CAPE ^{2}^{2}2CAPE is downloaded from https://cape.is.tue.mpg.de/. is a largescale real clothed human scan dataset containing 15 subjects and 150k 3D scans. Since CAPE only releases the scanned clothed human sequences with no separable garment meshes, it is not practical to train our network on CAPE from scratch. Therefore, we directly inference on CAPE using the network pretrained on CLOTH3D.
Evaluation Metrics. Three metrics are utilized for evaluation on CLOTH3D. We first evaluate the pervertex L2 error of canonical garments, . Secondly, we evaluate the pervertex L2 error of posed garments, which is defined as . The last one is the acceleration error, which is used to evaluate the smoothness of the predicted sequences and could be formulated as , where stands for acceleration vector of the th vertex of th frame.
Because the group truth garment meshes of CAPE are not provided, we evaluate the performance of different methods by using the ‘oneway’ Chamfer Distance from the vertices of the reconstructed meshes to the input clouds . It is formulated as , where stands for the nearest distance from point to points in .
Comparison Methods. Since it is the first work tackling garment reconstruction from point cloud sequences, there is no readily available prior work for direct comparison. Therefore, for canonical garment estimation, we adapt the PointNet++ qi2017pointnet++ structure to predict the PCA coefficients directly from input point clouds. For posed garments reconstruction, we adapt MultiGarment Net bhatnagar2019multi (MGN) on point cloud inputs for comparison.
3.2 Qualitative Results
StepbyStep Visualization. As shown in Fig. 4, we perform a stepbystep visualization of the whole reconstruction process. The estimated canonical skirt recovered rough shape and length of the skirt from the point cloud sequences. As expected, the interpolation LBS produces reasonable proposals with smooth surfaces and no artifacts. There are two things that make the LBS results look unnatural. One is the interpenetration between garments and huamn body. The other is the lifted skirt floating in the air, not touching the leg, which should be the cause of the lift. These two flaws are inevitable because the skirt is not homotopic with human body. These defects would be eliminated finally. The final step of displacement prediction meets the expectations of three aspects. Firstly, the hierarchical garment feature pooling module captures the geometric features of cloth, which helps the precise shape recovery of posed garments. Secondly, the human body surface encoding module makes the network aware of the body surface positions, which prevents the interpenetration of garments and human body and helps in capturing the deformations caused by body movements, e.g. knees lifting up the skirt. Thirdly, the temporal fusion module encourages smooth deformation between consecutive frames.
Reconstruction for Different Garment Types. More visualization of the reconstruction results are illustrated in Fig. 5, which shows Garment4D’s ability of handling both loose and tight garments driven by large body movements. For garments that are homotopic to human body, i.e. Tshirts and trousers, Garment4D can not only capture correct shape of them, but also the dynamics caused by body movement. Taking the bottom left trousers reconstruction as an example, when the right leg is stretched out, the back side of trousers stick to the back of calf. The front part of trousers naturally form the dynamics being pulled backwards. Moreover, it can be observed that when the leg or arm bends, more of the ankle or forearm are exposed in order to maintain the length of trousers legs or sleeves. For the reconstruction of skirts which are not homotopic to human body, both long and short skirts can be faithfully recovered. The short skirt is tighter and we can observe the natural deformation when the two legs are stretched to sides. For the looser long skirt, clear contacts and deformations can be observed when knees and calves lift. And when the legs are retracted to the neutral position, skirts would naturally fall back but would not stick to leg surfaces.
3.3 Quantitative Results on CLOTH3D
Canonical Garment Estimation. Tab. 3 reports the pervertex L2 error of canonical garment estimation of the adapted PointNet++ qi2017pointnet++ and our method. By explicitly making the network aware of the semantics our method outperforms the vanilla PointNet++ in all three garment types.
Posed Garment Reconstruction. The pervertex L2 errors of posed garment reconstruction are shown in Tab. 3. Because Tshirt and trousers are homotopic to human body, the interpolated LBS step could give decent reconstruction results. In contrast, for skirts, The results of interpolated LBS of skirts is far from the real situation and therefore have high L2 errors, which supports the above qualitative analysis. The displacement prediction effectively captures the dynamics of skirts and improves on top of interpolated LBS. Moreover, Garment4D outperforms the adapted MultiGarment Net bhatnagar2019multi in all three garment types. Especially, Garment4D surpasses the adapted MGN in skirt by , which further shows the advantage of Garment4D over SMPL+Dbased methods at reconstructing loose garments like skirts.
Temporal Smoothness. As reported in Tab. 3, our method also outperforms the adapted MGN in terms of reconstruction smoothness. Even though the same temporal constraints term is included in the loss functions of both methods, the adapted MGN fails to properly model the motion of garments and aggregate temporal information. Particularly for loose garments like skirts, Garment4D outperforms the adapted MGN by .


Class  MGN^{*}  Ours 
TShirt  0.430  0.366 
Trousers  0.922  0.455 

3.4 Quantitative Results on CAPE
As reported in Tab. 7, our method outperforms the adapted MGN on both classes in terms of the ‘oneway’ Chamfer Distance. It shows the effectiveness and advantages of Garment4D on real world data. Moreover, we directly inference on CAPE with the model trained on CLOTH3D without fine tuning, which further demonstrated the generalizability of Garment4D.
3.5 Ablation Study
Hierarchical Body Surface Encoder. As shown in Fig. 7, without the hierarchical body surface encoder (H.B.S Encoder), the L2 error increases by , which is in line with the expectations. Without knowing body positions, it is hard for networks to infer garmentbody interactions.
Temporal Transformer. As illustrated in Fig. 7, if we remove the temporal transformer, the L2 error increases by and the acceleration error increases by . The results indicate that the temporal transformer not only helps the single frame reconstruction by aggregating information from neighbouring frames, but also increases the smoothness of the reconstruction sequences.
Interpolated LBS. As circled out in the left side of Fig. 9, if the skinning weights of garment vertices are copied from the nearest body vertex, which is equivalent of SMPL+D (e.g. MGNbhatnagar2019multi), the severe clifflike artifacts appear between the legs. The quality of the LBS proposals would effect the final reconstruction results, which is shown in Fig. 7. If , the pervertex L2 error of the final reconstruction results would increase by .
3.6 Further Analysis
Number of Neighbors of Interpolated LBS. As shown in Fig. 9, the performance peaks at , which explains our parameter choice. The K nearest neighbors and weights interpolation are implemented for GPU parallel computation. Therefore, the computation overhead is minor with the increase of .
Performance on Incomplete Point Clouds. The realscanned point clouds can be incomplete, which makes an important setting to test on. To control the degree of incompleteness (i.e. the percentage of missing points), we randomly crop out holes from the input point clouds and performs evaluation on the skirt test set of CLOTH3D. As shown in Fig. 11, decent reconstruction results can be obtained at up to 50% of incompleteness, which shows the robustness of our methods on imperfect point clouds.
Robustness of Reconstruction Results over Segmentation Errors. The segmentation quality of input point clouds can affect the reconstruction results. To investigate such effects, we construct different degrees of segmentation errors and evaluate on the skirt test set of CLOTH3D. As shown in Fig. 11, a decrease of reconstruction quality can be observed with the decrease of segmentation accuracy. However, even with the accuracy of 50%, our method can still produce decent reconstruction results. Therefore, Garment4D is robust to segmentation errors.
4 Related Work
TemplateFree Clothed Human Modeling. Most previous templatefree methods reconstruct clothed human use volumetric representation or implicit function. varol2018bodynet; jackson20183d; zheng2019deephuman; natsume2019siclope; tang2019neural; gabeur2019moulding; smith2019facsimile reconstruct clothed human from images using volumetric representation. Due to large memory footprint of such representation, these methods usually have trouble in recovering fine details of surfaces. pifuSHNMKL19; saito2020pifuhd; saito2021scanimate; chen2021snarf; habermann2020deepcap propose efficient clothed human reconstruction pipeline with the help of implicit function. The above two types of methods both reconstruct human and cloth as a whole, which means it is not possible to directly take higher level of control of the garments, e.g. repose, retargeting.
TemplateFree Garment Modeling. In order to focus more on the garments, several recent work make some attempts to reconstruct human and garments separately using implicit function. bhatnagar2020combining predicts a layered implicit function, which makes it possible to separate garments from human body. corona2021smplicit builds a garment specific implicit function over the canonicalposed human body, which could be controlled by latent codes that describe garment cut and style. Not relying on any specific template could be an advantage (e.g.
more degrees of freedom). But in some applications
e.g. reposing, retargeting, texture mapping, it is a disadvantage because of poor interpretability. To achieve higher level of control, bhatnagar2020combining; corona2021smplicit have to register the garment meshes, which are reconstructed from implicit functions, to the SMPL+D model.Parametric Clothed Human Modeling. Most previous garment reconstruction work that use parametric methods tend to take human body and garment as a whole. A common practice, SMPL+D alldieck2019learning; alldieck2018detailed; alldieck2018video; alldieck2019tex2shape; sayo2019human; zhu2019detailed; xiang2020monoclothcap, is to predict displacements from SMPL SMPL:2015 vertices. However, SMPL+D has difficulty modeling loose garments or garments that do not share same topology as human body, which limits the expression ability of SMPL+D.
Parametric Garment Modeling. In order to reconstruct garments separately using parametric models, PCA decomposition is used by bhatnagar2019multi; jiang2020bcnet to model the single layered garment meshes, which is independent of any parametric human body model. To further increase the expressive ability of an individual parametric garment model, su2020deepcloth maps garments to two types of UVmaps according to whether the garment is homotopy to human body. As for the deformation of the garment models, most existing methods rely on the skinning process of human body. bhatnagar2019multi samples the skinning weights of garment vertices from the closest human body vertices jiang2020bcnet
predicts skinning weights of garment vertices using neural network. Only relying on the skinning process would lead to overly smoothed results, which is addressed by
patel2020tailornet through additional displacement prediction before skinning.5 Conclusion
In this work, we present the first attempt at garment reconstruction from point cloud sequences. A garment model, with the novel interpolated LBS, is introduced to adapt to loose garments. A garment mesh registration method is proposed to prepare sequential meshes for the garment reconstruction network. Taking point clouds sequences as input, the corresponding canonical garment is estimated by predicting PCA coefficients. Then a ProposalGuided Hierarchical Feature Network is introduced to perform semantic, geometric and humanbodyaware feature aggregation. Efficient Iterative GCN along with Temporal Transformer is utilized to further encode the features and predict displacements in an iterative way. Qualitative and quantitative results from extensive experiments demonstrate the effectiveness of the proposed framework.
Acknowledgments This study is supported by NTU NAP, MOE AcRF Tier 1 (2021T1001088), and under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAFICP) Funding Initiative, as well as cash and inkind contribution from the industry partner(s).