3D Pose Transfer with Correspondence Learning and Mesh Refinement

09/30/2021
by   Chaoyue Song, et al.
Nanyang Technological University
0

3D pose transfer is one of the most challenging 3D generation tasks. It aims to transfer the pose of a source mesh to a target mesh and keep the identity (e.g., body shape) of the target mesh. Some previous works require key point annotations to build reliable correspondence between the source and target meshes, while other methods do not consider any shape correspondence between sources and targets, which leads to limited generation quality. In this work, we propose a correspondence-refinement network to help the 3D pose transfer for both human and animal meshes. The correspondence between source and target meshes is first established by solving an optimal transport problem. Then, we warp the source mesh according to the dense correspondence and obtain a coarse warped mesh. The warped mesh will be better refined with our proposed Elastic Instance Normalization, which is a conditional normalization layer and can help to generate high-quality meshes. Extensive experimental results show that the proposed architecture can effectively transfer the poses from source to target meshes and produce better results with satisfied visual performance than state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 8

page 17

page 18

page 19

page 20

03/16/2020

Neural Pose Transfer by Spatially Adaptive Instance Normalization

Pose transfer has been studied for decades, in which the pose of a sourc...
10/11/2021

Mesh Draping: Parametrization-Free Neural Mesh Transfer

Despite recent advances in geometric modeling, 3D mesh modeling still in...
07/25/2021

Can Action be Imitated? Learn to Reconstruct and Transfer Human Dynamics from Videos

Given a video demonstration, can we imitate the action contained in this...
03/08/2019

3DN: 3D Deformation Network

Applications in virtual and augmented reality create a demand for rapid ...
11/16/2021

Coarse-to-fine Animal Pose and Shape Estimation

Most existing animal pose and shape estimation approaches reconstruct an...
07/21/2020

Neural Mesh Flow: 3D Manifold Mesh Generationvia Diffeomorphic Flows

Meshes are important representations of physical 3D entities in the virt...
08/17/2021

Intrinsic-Extrinsic Preserved GANs for Unsupervised 3D Pose Transfer

With the strength of deep generative models, 3D pose transfer regains in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D pose transfer has been drawing a lot of attention from the vision and graphics community. It has potential applications in 3D animated movies and games for generating new poses for existing shapes and animation sequences. 3D pose transfer is a learning-driven generation task which is similar to style transfer on 2D images. As shown in Figure 1, pose transfer takes two inputs, one is identity mesh that provides identity information of the mesh (e.g., body shape), the other is pose mesh that provides pose information. It aims at transferring the pose of a source pose mesh to a target identity mesh and keeping the identity of the target identity mesh.

A fundamental problem for previous methods is to build reliable correspondence between source and target meshes. It can be very challenging when the source and target meshes have significant differences. Most of the previous methods try to solve it with the help of user effort or other additional inputs, such as key point annotations [3, 35, 42], etc. Unfortunately, it is time-consuming to obtain such additional inputs that will limit the usage in practice. In [38], they proposed to implement pose transfer without correspondence learning. Their method is convenient but the performance will be degraded since they do not consider the correspondence between meshes. In this work, we propose a COrrespondence-REfinement Network (3D-CoreNet) to solve the pose transfer problem for both the human and animal meshes. Like [38], our method does not need key point annotations or other additional inputs. We learn the shape correspondence between identity and pose meshes first, then we warp the pose mesh to a coarse warped output according to the correspondence. Finally, the warped mesh will be refined to have a better visual performance. Our method does not require the two meshes to have the same number or order of vertices.

For the correspondence learning module, we treat the shape correspondence learning as an optimal transport problem to learn the correspondence between meshes. Our network takes vertex coordinates of identity and pose meshes as inputs. We extract deep features at each vertex using point cloud convolutions and compute a matching cost between the vertex sets with the extracted features. Our goal is to minimize the matching cost to get an optimal matching matrix. With the optimal matching matrix, we warp the pose mesh and obtain a coarse warped mesh. We then refine the warped output with a set of elastic instance normalization residual blocks, the modulation parameters in the normalization layers are learned with our proposed

Elastic Instance Normalization (ElaIN). In order to generate smoother meshes with more details, we introduce a channel-wise weight in ElaIN to adaptively blend feature statistics of original features and the learned parameters from external data, which help to keep the consistency and continuity of the original features.

Our contributions can be summarized as follows:
We solve the pose transfer problem with our proposed correspondence-refinement network. To the best of our knowledge, our method is the first to learn the correspondence between different meshes and refine the generated meshes jointly in the 3D pose transfer task.
We learn the shape correspondence by solving an optimal transport problem without any key point annotations and generate high-quality final meshes with our proposed elastic instance normalization in the refinement module.
Through extensive experiments, we demonstrate that our method outperforms state-of-the-art methods quantitatively and qualitatively on both human and animal meshes.

Figure 1: Pose transfer results generated by our 3D-CoreNet. In the first two rows, the human identity and pose meshes are from SMPL [24], in the last two rows, the animal identity and pose meshes are from SMAL [47].

2 Related work

2.1 Deep learning methods on 3D data

The representations of 3D data are various, like point clouds, voxels and meshes. 3DShapeNets [41] and VoxNet [26] propose to learn on volumetric grids. Their methods cannot be applied on complex data due to the sparsity of data and computation cost of 3D convolution. PointNet [30]

uses a shared MLP on every point followed by a global max-pooling. Following PointNet, some hierarchical architectures have been proposed to aggregate local neighborhood information with MLPs

[21, 31]. [14, 36]

proposed mesh variational autoencoder to learn mesh features whose methods consume a large amount of computing resources due to their fully-connected networks. Many works use graph convolutions with mesh down- and up-sampling layers

[16], like CoMA [32], CAPE [25] based on ChebyNet [11], and [46] based on SpiralNet [22], SpiralNet++ [17], etc. They all need a template to implement their hierarchical structure which is not applicable to open-world problems. In this work, we use mesh as the representation of 3D shape and shared weights convolution layers in the network.

2.2 3D pose transfer

Deformation transfer in graphics aims to apply the deformation exhibited by a source mesh onto a different target mesh [35]. 3D pose transfer aims to generate a new mesh based on the knowledge of a pair of source and target meshes. In [3, 35, 42, 43], the methods all require to label the corresponding landmarks first to deal with the differences between meshes. Baran et al. [2] proposed a method that infers a semantic correspondence between different poses of two characters with the guidance of example mesh pairs. Chu et al. [8] proposed to use a few examples to generate results which will make it difficult to automatically transfer pose. For this problem, Gao et al. [15]

proposed to use cycle consistency to achieve the pose transfer. However, their method cannot deal with new identities due to the limitations of the visual similarity metric. In

[38], they solved the pose transfer via the latest technique for image style transfer. Their work does not need other guidance, but the performance is also restrained since they do not learn any correspondence. To solve the problems, our network learns the correspondence and refines the generated meshes jointly.

2.3 Correspondence learning

In CoCosNet [45]

, they introduced a correspondence network based on the correlation matrix between images without any constraints. To learn a better matching, we proposed to use optimal transport to learn the correspondence between meshes. Recently, optimal transport has received great attention in various computer vision tasks. Courty et al.

[9] perform the alignment of the representations in the source and target domains by learning a transportation plan. Su et al. [34] compute the optimal transport map to deal with the surface registration and shape space problem. Other applications include generative model [1, 6, 12, 40], scene flow [29], semantic correspondence [23] and etc.

2.4 Conditional normalization layers

After normalizing the activation value, conditional normalization uses the modulation parameters calculated from the external data to denormalize it. Adaptive Instance Normalization (AdaIN) [19]

aligns the mean and variance between content and style image which achieves arbitrary style transfer. Soft-AdaIN

[7] introduces a channel-wise weight to blend feature statistics of content and style image to preserve more details for the results. Spatially-Adaptive Normalization (SPADE) [28] can better preserve semantic information by not washing away it when applied to segmentation masks. SPAdaIN [38]

changes batch normalization

[20] in SPADE to instance normalization [37] for 3D pose transfer. However, it will break the consistency and continuity of the feature map when doing the denormalization, which has a bad influence on the mesh smoothness and detail preservation. To address this problem, our ElaIN introduces an adaptive weight to implement the denormalization.

3 Method

Given a source pose mesh and a target identity mesh, our goal is to transfer the pose of source mesh to the target mesh and keep the identity of the target mesh. In this section, we will introduce our end-to-end Correspondence-Refinement Network (3D-CoreNet) for 3D pose transfer.

Figure 2: The architecture of 3D-CoreNet. With the extracted features, the shape correspondence between identity and pose meshes is first established by solving an optimal transport problem. Then, we warp the pose mesh according to the optimal matching matrix and obtain a coarse warped mesh. The warped mesh will be better refined with our proposed ElaIN in the refinement module.

A 3D mesh can be represented as , where denotes the mesh identity, represents the pose of the mesh, is the vertex order of the mesh. Given two meshes and , we aim to transfer the pose of to and generate the output mesh . In our 3D-CoreNet, we take and as inputs, which are the coordinates of the mesh vertices. and denote the number of vertices of identity mesh and pose mesh respectively.

As shown in Figure 2, the vertices of the meshes are first fed into the network to extract multi-scale deep features. We calculate the matching matrix with the vertex feature maps by solving an optimal transport problem, then we warp the pose mesh according to the matrix and obtain the warped mesh. Finally, the warped mesh is refined to the final output mesh with our proposed elastic instance normalization (ElaIN). The output mesh combines the pose from the source mesh and the identity from the target. And it inherits the vertex order from the identity mesh.

3.1 Correspondence learning

Given an identity mesh and a pose mesh, our correspondence learning network calculates an optimal matching matrix, each element of the matching matrix represents the similarities between two vertices in the two meshes. The first step in our shape correspondence learning is to compute a correlation matrix with the extracted features, which is based on cosine similarity and denotes the matching similarities between any two positions from different meshes. However, the matching scores in the correlation matrix are calculated without any additional constraints. To learn a better matching, we solve this problem from a global perspective by modeling it as an optimal transport problem.

Correlation matrix

We first introduce our feature extractor which aims to extract features for the unordered input vertices. Close to [38], our feature extractor consists of 3 stacked

convolution and Instance Normalization layers, the activation functions applied are all LeakyReLU. Given the extracted vertex feature maps

, of identity and pose meshes ( is the channel-wise dimension), a popular method to compute correlation matrix is using the cosine similarity [44, 45]. Concretely, we compute the correlation matrix as:

(1)

where denotes the individual matching score between and , and represent the channel-wise feature of at position i and at j.

Optimal transport problem

To learn a better matching with additional constraints in this work, we model our shape correspondence learning as an optimal transport problem. We first define a matching matrix between identity and pose meshes. Then we can get the total correlation as . The aim will be maximizing the total correlation score to get an optimal matching matrix .

We treat the correspondence learning between identity and pose meshes as the transport of mass. A mass which is equal to will be assigned to each vertex in the identity mesh, and each vertex in pose mesh will receive the mass from identity mesh through the built correspondence between vertices. Then if we define as the cost matrix, our goal can be formulated as a standard optimal transport problem by minimizing the total matching cost,

(2)

where and

are vectors whose elements are all

. The first constraint in Eq. 2 means that the mass of each vertex in will be entirely transported to some of the vertices in . And each vertex in will receive a mass from some of the vertices in with the second constraint. This problem can be solved by the Sinkhorn-Knopp algorithm [33]. The details of the solved process will be given in the supplementary material.

With the matching matrix, we can warp the pose mesh and obtain the vertex coordinates of the warped mesh.

(3)

The warped mesh inherits the number and order of vertex from identity mesh and can be reconstructed with the face information of identity mesh as shown in Figure 2.

3.2 Mesh refinement

In this section, we introduce our mesh refinement module which refines the warped mesh to the desired output progressively.

Elastic instance normalization

Previous conditional normalization layers [19, 28, 38] used in different tasks always calculated their denormalization parameters only with the external data. We argue that it may break the consistency and continuity of the original features. Inspired by [7], we propose Elastic Instance Normalization (ElaIN) which blends the statistics of original features and the learned parameters from external data adaptively and elastically.

As shown in Figure 2, the warped mesh is flatter than we desired and is kind of out of shape, but it inherits the pose from the source mesh successfully. Therefore, we denormalize the warped mesh with the feature maps of the identity mesh to get a better final output. Here, we let as the activation value before the normalization layer, where is the batch size, is the dimension of feature channel and

is the number of vertex. At first, we normalize the feature maps of the warped mesh with instance normalization and the mean and standard deviation are calculated across spatial dimension (

) for each sample and each channel ,

(4)
(5)

Then the feature maps of the identity mesh are fed into a convolution layer to get , which shares the same size with . We adopt global average pooling to pool , into tensors. The tensors are then concatenated in channel dimension to get a tensor. A fully-connected layer is employed to compute an adaptive weight with the concatenated tensor. With , we can define the modulation parameters of our normalization layer.

(6)

where and are learned from the identity feature with two convolution layers. Finally, we can scale the normalized with and shift it with ,

(7)

Refinement network

Our refinement network is designed to refine the warped mesh progressively. Following [28, 38, 45], We design the ElaIN residual block with our ElaIN in the form of ResNet blocks [18]. As shown in Figure 2, our architecture contains l ElaIN residual blocks. Each of them consists of our proposed ElaIN followed by a simple convolution layer and LeakyReLU. With the ElaIN residual blocks, the warped mesh is refined to our desired high-quality output. Please refer to the supplementary material for the detailed architecture of ElaIN.

3.3 Loss function

We jointly train the correspondence learning module along with mesh refinement module by minimizing the following loss functions,

Reconstruction loss

Following [38], we train our network with the supervision of the ground truth mesh . We first process the ground truth mesh to have the same vertex order as the identity mesh. Then we define the reconstruction loss by calculating the point-wise distance between the vertices of and ,

(8)

where and are the vertices of and respectively. Notice that they all share the same size and order with the vertices of the identity mesh. With the reconstruction loss, the mesh predicted by our model will be closer to the ground truth.

Edge loss

In this work, we also introduce edge loss which is often used in 3D mesh generation tasks [27, 38, 39]. Since the reconstruction loss does not consider the connectivity of mesh vertices, the generated mesh may suffer from flying vertices and overlong edges. Edge loss can help penalize flying vertices and generate smoother surfaces. For every , let be the neighbor of , the edge loss can be defined as,

(9)

Then we can train our network with the combined loss function ,

(10)

where denotes the weight of reconstruction loss.

4 Experiment

Datasets.

For the human mesh dataset, we use the same dataset generated by SMPL [24] as Wang et al. [38]

. This dataset consists of 30 identities with 800 poses. Each mesh has 6890 vertices. For the training data, we randomly choose 4000 pairs (identity and pose mesh) from 16 identities with 400 poses and shuffle them every epoch. The ground truth meshes will be determined according to the identity and pose parameters from the pairs. Before feeding into our network, every mesh will be shuffled randomly to be close to the open-world problem. Notice that the ground truth mesh will share the same vertex order with identity mesh for the convenience of supervision training and evaluation. For all input meshes, we shift them to the center according to their bounding box. When doing the test, we evaluate our model with 14 new identities with 200 unseen poses. We randomly choose 400 pairs for testing. They will be pre-processed in the same manner as training data. To further test the generalization of our model, we also try our model with FAUST

[5] and MG-dataset [4] in the experiment.

For the animal mesh dataset, we generate an animal training and test data using SMAL model [47]. This dataset has 41 identities with 600 poses. The 41 identities are 21 felidae animals (1 cat, 5 cheetahs, 8 lions, 7 tigers), 5 canidae animals (2 dogs, 1 fox, 1 wolf, 1 hyena), 8 equidae animals (1 deer, 1 horse, 6 zebras), 4 bovidae animals (4 cows), 3 hippopotamidae animals (3 hippos). Every mesh has 3889 vertices. For the training data, we randomly choose 11600 pairs from 29 identities (16 felidae animals, 3 canidae animals, 6 equidae animals, 2 bovidae animals, 2 hippopotamidae animals) with 400 poses. For the test data, we randomly choose 400 pairs from other 12 identities (5 felidae animals, 2 canidae animals, 2 equidae animals, 2 bovidae animals, 1 hippopotamidae animal) with 200 poses. All the inputs are pre-processed in the same manner as we do in the human mesh.

Evaluation metrics.

Following [38]

, we use Point-wise Mesh Euclidean Distance (PMD) as one of our evaluation metrics. PMD is the

distance between the vertices of the output mesh and the ground truth mesh. We also evaluate our model with Chamfer Distance (CD) and Earth Mover’s Distance (EMD) proposed in [13]. For PMD, CD and EMD, smaller is better.

Implementation details.

The

in the loss function is set as 2000. We implement our model with Pytorch and use Adam optimizer. Please refer to the supplementary material for the details of the network. Our model is trained for 200 epochs on one RTX 3090 GPU, the learning rate is fixed at

in the first 100 epochs and decays each epoch after 100 epochs. The batch size is 8.

4.1 Comparison with the state-of-the-arts

Annotation Dataset PMD CD EMD
DT [35] Key points 0.15 0.35 2.21
Wang et al. [38] - SMPL [24] 0.66 1.42 4.22
Ours - 0.08 0.22 1.89
DT [35] Key points 13.37 35.77 15.90
Wang et al. [38] - SMAL [47] 6.75 14.52 11.65
Ours - 2.26 4.05 7.28
Table 1: Quantitative comparison with other methods. We compare our method with DT (needs key point annotations) and Wang et al. using PMD, CD, EMD as our evaluation metrics on both human and animal data. For them, smaller is better. The PMD and CD are in units of and the EMD is in units of .
Figure 3: Qualitative comparison of different methods on human data. The identity and pose meshes are from SMPL [24]. Our method and DT (needs key point annotations) can generate better results than Wang et al. when doing pose transfer on human meshes. The results generated by Wang et al. are always not smooth on the arms or legs. Since DT needs user to label the key point annotations, our method is more efficient and practical than DT.
Figure 4: Qualitative comparison of different methods on animal data. The identity and pose meshes are from SMAL [47]. Our method produces more successful results when doing pose transfer on different animal meshes. Although DT has key point annotations, it still fails to transfer the pose when the identity of the mesh pairs are very different. The method of Wang et al. produces very flat legs and wrong direction faces.

In this section, we compare our method with Deformation Transfer (DT) [35] and Wang et al. [38]. DT needs to rely on the control points labeled by user and a reference mesh, as the additional inputs. Therefore, we test DT with the reference mesh and 11, 19 labeling points on animal data and human data respectively. For [38], their method does not consider any correspondence. We train their model using the implementations provided by the authors. The qualitative results tested on SMPL [24] and SMAL [47] are shown in Figure 3 and Figure 4. As we can see, when tested on human data, DT and our method can produce better results that are close to the ground truth. However, it is very time-consuming for DT when dealing with a new identity and pose mesh pair. The results generated by Wang et al. always do not learn the right pose and are not very smooth on the arms or legs since they do not consider any correspondence. When tested on animal data, DT fails to transfer the pose even if we add more labeling points. Their method does not work when the identity of the mesh pairs are very different. Wang et al. may produce very flat legs and wrong direction faces. In comparison, our method still produces satisfactory results efficiently.

We adopt Point-wise Mesh Euclidean Distance (PMD), Chamfer Distance (CD) and Earth Mover’s Distance (EMD) to evaluate the generated results of different methods. All metrics are calculated between the ground truth and the predicted results. The quantitative results are shown in Table 1. Our 3D-CoreNet outperforms other methods in all metrics over two datasets. When doing pose transfer on animal data that contains more different identities, our method has more advantages.

4.2 Ablation study

Figure 5: Ablation study results. We test 5 variants on SMPL [24]. The first two are tested without refinement, Corr (a) uses the correlation matrix and Corr (b) uses the optimal matching matrix to learn the correspondence. The model does not perform well without refinement. Using the optimal matching matrix has a better performance than using correlation matrix. In the third column, the surface of the mesh has clear artifacts and is not smooth when we replace ElaIN with SPAdaIN.
Dataset Corr (a) Corr (b) w/o ElaIN w/o Full model
PMD 0.46 0.44 0.15 0.14 0.08
SMPL [24] CD 1.39 1.28 0.37 0.34 0.22
EMD 3.49 3.42 2.57 2.28 1.89
Table 2: Ablation study. We use all 3 measurements here. For them, smaller is better. The PMD and CD are in units of and the EMD is in units of . w/o means without this component. In the third and fourth column, we only use our correspondence learning module without refinement. Corr (a) uses the correlation matrix and Corr (b) uses the optimal matching matrix to learn the correspondence. In w/o ElaIN, we replace our ElaIN with SPAdaIN in [38] to compare them.

In this section, we study the effectiveness of several components in our 3D-CoreNet on human data. At first, we test our model without the refinement module. We only use our correspondence module with the correlation matrix or the optimal matching matrix respectively. Here, the warped mesh will be viewed as the final output and the reconstruction loss will be calculated between the warped mesh and the ground truth. Then we will compare our ElaIN with SPAdaIN in [38] to verify the effectiveness of ElaIN. And we also test the importance of edge loss .

The results are shown in Table 2 and Figure 5. We evaluate the variants with PMD, CD and EMD. As we can see, when we do not add our refinement module, the model does not perform well both qualitatively and quantitatively. And using the optimal matching matrix has a better performance than using correlation matrix. When we replace our ElaIN with SPAdaIN, the surface of the mesh has clear artifacts and is not smooth. The metrics are also worse than the full model. We can know that ElaIN is very helpful in generating high quality results. We also evaluate the importance of . The connection between vertices will be better and smoother with the edge loss.

Figure 6: Pose transfer results on human data from different datasets. We test our model on FAUST [5] and MG-dataset [4] which contain different human meshes with SMPL [24]. Our method still has a good performance. Please refer to the supplementary material for more generated results.

4.3 Generalization capability

To evaluate the generalization capability of our method, we evaluate it on FAUST [5] and MG-dataset [4] in this section. Human meshes in FAUST have the same number of vertices as SMPL [24] and have more unseen identities. In MG-dataset, the human meshes are all dressed which have 27554 vertices each and have more realistic details. As shown in Figure 6, our method can also have a good performance on FAUST and MG-dataset. In the first group, we transfer the pose from FAUST to the identity in SMPL. In the second group, we transfer the pose from SMPL to the identity in MG-dataset. Both of them transfer the pose and keep the identity successfully.

5 Conclusion

In this paper, we propose a correspondence-refinement network (3D-CoreNet) to transfer the pose of source mesh to the target mesh while retaining the identity of the target mesh. 3D-CoreNet

learns the correspondence between different meshes and refine the generated meshes jointly. Our method generates high-quality meshes with the proposed ElaIN for refinement. Compared to other methods, our model learns the correspondence without key point labeling and achieves better performance when working on both human and animal meshes. In the future, we will extend our approach to unsupervised learning with the help of cycle consistency.

Acknowledgements

This research was conducted in collaboration with SenseTime. This work is supported by A*STAR through the Industry Alignment Fund - Industry Collaboration Projects Grant. This work is also supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-RP-2018-003), and the MOE Tier-1 research grants: RG28/18 (S) and RG95/20.

References

  • [1] M. Arjovsky, S. Chintala, and L. Bottou (2017)

    Wasserstein generative adversarial networks

    .
    In

    International conference on machine learning

    ,
    pp. 214–223. Cited by: §2.3.
  • [2] I. Baran, D. Vlasic, E. Grinspun, and J. Popović (2009) Semantic deformation transfer. In ACM SIGGRAPH 2009 papers, pp. 1–6. Cited by: §2.2.
  • [3] M. Ben-Chen, O. Weber, and C. Gotsman (2009) Spatial deformation transfer. In Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 67–74. Cited by: §1, §2.2.
  • [4] B. L. Bhatnagar, G. Tiwari, C. Theobalt, and G. Pons-Moll (2019) Multi-garment net: learning to dress 3d people from images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5420–5430. Cited by: Figure 10, §B.2, Table 5, Figure 6, §4, §4.3.
  • [5] F. Bogo, J. Romero, M. Loper, and M. J. Black (2014) FAUST: dataset and evaluation for 3d mesh registration. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3794–3801. Cited by: Figure 10, §B.2, Table 5, Figure 6, §4, §4.3.
  • [6] C. Bunne, D. Alvarez-Melis, A. Krause, and S. Jegelka (2019) Learning generative models across incomparable spaces. In International Conference on Machine Learning, pp. 851–861. Cited by: §2.3.
  • [7] Y. Chen, M. Chen, C. Song, and B. Ni (2020) CartoonRenderer: an instance-based multi-style cartoon image translator. In International Conference on Multimedia Modeling, pp. 176–187. Cited by: §2.4, §3.2.
  • [8] H. Chu and C. Lin (2010) Example-based deformation transfer for 3d polygon models.. J. Inf. Sci. Eng. 26 (2), pp. 379–391. Cited by: §2.2.
  • [9] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy (2016) Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence 39 (9), pp. 1853–1865. Cited by: §2.3.
  • [10] M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. Advances in neural information processing systems 26, pp. 2292–2300. Cited by: §A.2.
  • [11] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3844–3852. Cited by: §2.1.
  • [12] I. Deshpande, Y. Hu, R. Sun, A. Pyrros, N. Siddiqui, S. Koyejo, Z. Zhao, D. Forsyth, and A. G. Schwing (2019) Max-sliced wasserstein distance and its use for gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10648–10656. Cited by: §2.3.
  • [13] H. Fan, H. Su, and L. J. Guibas (2017) A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613. Cited by: §4.
  • [14] Y. Feng, Y. Feng, H. You, X. Zhao, and Y. Gao (2019)

    Meshnet: mesh neural network for 3d shape representation

    .
    In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 8279–8286. Cited by: §2.1.
  • [15] L. Gao, J. Yang, Y. Qiao, Y. Lai, P. L. Rosin, W. Xu, and S. Xia (2018) Automatic unpaired shape deformation transfer. ACM Transactions on Graphics (TOG) 37 (6), pp. 1–15. Cited by: §2.2.
  • [16] M. Garland and P. S. Heckbert (1997) Surface simplification using quadric error metrics. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 209–216. Cited by: §2.1.
  • [17] S. Gong, L. Chen, M. Bronstein, and S. Zafeiriou (2019) Spiralnet++: a fast and highly efficient mesh convolution operator. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §2.1.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §3.2.
  • [19] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510. Cited by: §2.4, §3.2.
  • [20] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §2.4.
  • [21] J. Li, B. M. Chen, and G. H. Lee (2018) So-net: self-organizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: §2.1.
  • [22] I. Lim, A. Dielen, M. Campen, and L. Kobbelt (2018) A simple approach to intrinsic correspondence learning on unstructured 3d meshes. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Cited by: §2.1.
  • [23] Y. Liu, L. Zhu, M. Yamada, and Y. Yang (2020) Semantic correspondence as an optimal transport problem. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4463–4472. Cited by: §2.3.
  • [24] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM transactions on graphics (TOG) 34 (6), pp. 1–16. Cited by: Table 3, Figure 8, §B.1, Table 5, Figure 1, Figure 3, Figure 5, Figure 6, §4, §4.1, §4.3, Table 1, Table 2.
  • [25] Q. Ma, J. Yang, A. Ranjan, S. Pujades, G. Pons-Moll, S. Tang, and M. J. Black (2020) Learning to dress 3d people in generative clothing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6469–6478. Cited by: §2.1.
  • [26] D. Maturana and S. Scherer (2015) Voxnet: a 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §2.1.
  • [27] J. Pan, X. Han, W. Chen, J. Tang, and K. Jia (2019) Deep mesh reconstruction from single rgb images via topology modification networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9964–9973. Cited by: §3.3.
  • [28] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2337–2346. Cited by: §2.4, §3.2, §3.2.
  • [29] G. Puy, A. Boulch, and R. Marlet (2020) FLOT: scene flow on point clouds guided by optimal transport. arXiv preprint arXiv:2007.11142. Cited by: §2.3.
  • [30] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017)

    Pointnet: deep learning on point sets for 3d classification and segmentation

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §2.1.
  • [31] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++ deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5105–5114. Cited by: §2.1.
  • [32] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black (2018) Generating 3d faces using convolutional mesh autoencoders. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 704–720. Cited by: §2.1.
  • [33] R. Sinkhorn (1967) Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly 74 (4), pp. 402–405. Cited by: §A.2, §3.1.
  • [34] Z. Su, Y. Wang, R. Shi, W. Zeng, J. Sun, F. Luo, and X. Gu (2015) Optimal mass transport for shape matching and comparison. IEEE transactions on pattern analysis and machine intelligence 37 (11), pp. 2246–2259. Cited by: §2.3.
  • [35] R. W. Sumner and J. Popović (2004) Deformation transfer for triangle meshes. ACM Transactions on graphics (TOG) 23 (3), pp. 399–405. Cited by: §B.2, §B.4, Table 4, §1, §2.2, §4.1, Table 1.
  • [36] Q. Tan, L. Gao, Y. Lai, and S. Xia (2018) Variational autoencoders for deforming 3d mesh models. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5841–5850. Cited by: §2.1.
  • [37] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §2.4.
  • [38] J. Wang, C. Wen, Y. Fu, H. Lin, T. Zou, X. Xue, and Y. Zhang (2020) Neural pose transfer by spatially adaptive instance normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5831–5839. Cited by: Figure 10, §B.2, §B.4, Table 4, §1, §2.2, §2.4, §3.1, §3.2, §3.2, §3.3, §3.3, §4, §4, §4.1, §4.2, Table 1, Table 2.
  • [39] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang (2018) Pixel2mesh: generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–67. Cited by: §3.3.
  • [40] J. Wu, Z. Huang, D. Acharya, W. Li, J. Thoma, D. P. Paudel, and L. V. Gool (2019) Sliced wasserstein generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3713–3722. Cited by: §2.3.
  • [41] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §2.1.
  • [42] J. Yang, L. Gao, Y. Lai, P. L. Rosin, and S. Xia (2018) Biharmonic deformation transfer with automatic key point selection. Graphical Models 98, pp. 1–13. Cited by: §1, §2.2.
  • [43] W. Yifan, N. Aigerman, V. G. Kim, S. Chaudhuri, and O. Sorkine-Hornung (2020) Neural cages for detail-preserving 3d deformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 75–83. Cited by: §2.2.
  • [44] B. Zhang, M. He, J. Liao, P. V. Sander, L. Yuan, A. Bermak, and D. Chen (2019)

    Deep exemplar-based video colorization

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8052–8061. Cited by: §3.1.
  • [45] P. Zhang, B. Zhang, D. Chen, L. Yuan, and F. Wen (2020) Cross-domain correspondence learning for exemplar-based image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5143–5153. Cited by: §2.3, §3.1, §3.2.
  • [46] K. Zhou, B. L. Bhatnagar, and G. Pons-Moll (2020) Unsupervised shape and pose disentanglement for 3d meshes. In European Conference on Computer Vision, pp. 341–357. Cited by: §2.1.
  • [47] S. Zuffi, A. Kanazawa, D. W. Jacobs, and M. J. Black (2017) 3D menagerie: modeling the 3d shape and pose of animals. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6365–6373. Cited by: Figure 9, §B.1, Table 5, Figure 1, Figure 4, §4, §4.1, Table 1.

Appendix A More details of 3D-CoreNet

a.1 Network architecture

The detailed architecture of our correspondence-refinement network (3D-CoreNet) is shown in Table 3. We take the vertices of the identity and pose meshes as inputs. Both of them are fed into the feature extractor and the adaptive feature block. The feature extractor consists of three Conv1d-InstanceNorm-LeakyReLU blocks. Then we can calculate the optimal matching matrix with their features by solving an optimal transport problem. With the matrix, we warp the pose mesh to the coarse warped mesh. Finally, the warped mesh is better refined in the mesh refinement module with a set of elastic instance normalization residual blocks. The modulation parameters in the normalization layers are learned with elastic instance normalization.

The design of our elastic instance normalization (ElaIN) is shown in Figure 7. At first, we normalize the features of the warped mesh with instance normalization and get the mean and standard deviation . Then, the features of the identity mesh are fed into a simple convolution layer to get , which shares the same size with . We adopt global average pooling to pool , and concatenate them in the channel dimension. A fully-connected layer is employed to compute an adaptive weight . We blend , and , elastically with to get the modulation parameters and , where and are learned from with two convolution layers. Finally, we scale the normalized with and shift it with .

Figure 7: The detailed design of our elastic instance normalization. Here, we normalize the features of the warped mesh with InstanceNorm and get the mean and standard deviation . Then, the features of the identity mesh are fed into a convolution layer to get , which shares the same size with . We adopt global average pooling to pool , and concatenate them in channel dimension. A fully-connected layer is employed to compute an adaptive weight . We blend , and , elastically with to get and . and are learned from . Finally, we scale the normalized with and shift it with . The value on the parameter flow means the weight.
Module Layers Output ()
Feature extractor
Conv1d ()
Conv1d ()
Conv1d ()
Correspondence Learning Adaptive feature block
Resblock 4
Conv1d ()
Optimal transport Matching matrix
Warping Warped mesh
Mesh Refinement Refinement
Conv1d ()
Conv1d ()
ElaIN Resblock
Conv1d ()
ElaIN Resblock
Conv1d ()
ElaIN Resblock
Conv1d ()
Table 3: The network architecture of 3D-CoreNet. N indicates the number of vertices, D is the number of feature channels. We give an example when training on SMPL [24]. The number of vertices is 6890. () means the kernel size of the convolution layer is .

a.2 Solving OT problem with Sinkhorn algorithm

In this section, we solve the optimal transport (OT) problem defined in Section 3.1 with Sinkhorn algorithm [33]. Following [10], we introduce an entropic regularization term to solve the OT problem efficiently,

(11)

where , and are the transport matrix, cost matrix and optimal matching matrix respectively, and are vectors whose elements are all , is the regularization parameter. The details of the solving process are shown in Algorithm 1.

0:  Cost matrix , regularization parameter , iteration number .
0:  Optimal matching matrix .
  ;
  ;
  for  do
     ;
     ;
  end for
  
Algorithm 1 Optimal transport problem with Sinkhorn algorithm.

a.3 More implementation details

In Algorithm 1, we set and . in Eq. 5 is . We train our model on one RTX 3090 GPU. The training time is about 24 hours on the human data and about 36 hours on the animal data.

Appendix B More experimental results

b.1 More results on the human and animal data

In Figure 8 and Figure 9, we show more results generated by our 3D-CoreNet on SMPL [24] and SMAL [47] respectively.

Figure 8: More results generated by 3D-CoreNet on SMPL [24].
Figure 9: More results generated by 3D-CoreNet on SMAL [47].

b.2 Generalization capability

In Figure 10, we show more results generated by 3D-CoreNet on FAUST [5] and MG-Dataset [4]. To further test the generalization capability of our model, we compare it with Wang et al. [38]. Since DT [35] needs reference meshes as the additional inputs and there are no reference meshes when testing on the new dataset, we do not compare with DT in this section. As we can see, the results generated by our method are more smooth and realistic than Wang et al.. The results generated by Wang et al. always have some artifacts on the arms.

Figure 10: More results generated by 3D-CoreNet on FAUST [5] and MG-Dataset [4]. The identity meshes are from FAUST and MG-Dataset. We compare our method with Wang et al. [38] to test the generalization capability of our 3D-CoreNet.

b.3 Robustness to noise

To test the robustness of our model, we add noise to the vertex ordinates of the pose mesh. The results are shown in Figure 11, our model still produces high-quality results.

Figure 11: Robustness to noise. Here, we add noise to the pose mesh. Our model can still produce high-quality results.

b.4 Average inference times

In this section, we compare the average inference times for every pose transfer of different methods in the same experimental settings. As shown in Table 4, the traditional deformation transfer method [35] takes the longest time compared to the deep learning-based methods. For [38], they do not learn the correspondence between meshes, so they have the shortest inference time but the generation performance is degraded. 3D-CoreNet achieves notable improvements in generating high-quality results while the inference time is also acceptable. The fourth and the fifth columns show that solving the optimal transport problem takes a very short time while improving the generation results.

Method [35] [38] 3D-CoreNet () 3D-CoreNet ()
Time 3.3352s 0.0068s 0.0124s 0.0131s
Table 4: Average inference times of different methods. 3D-CoreNet () means 3D-CoreNet with the correlation matrix and 3D-CoreNet () means 3D-CoreNet with the optimal matching matrix.

b.5 Limitations

Although our method produces satisfactory results in most cases and has better performance than previous works, there are still some limitations that need to be solved in the future.

For example, when testing on the animal data as shown in Figure 9, the tails of the animals are difficult to handle properly (the third row and the sixth row). When evaluating our model on new identity meshes in Figure 10, if the generated mesh reveals some parts of the human body that were not exposed in the original identity mesh, such as the underarms, it will produce some artifacts (the fourth row).

Appendix C Licenses of the assets

The licenses of the assets used in this paper are shown in Table 5. Their licenses are given in the websites.

Data License websites
SMPL [24] https://smpl.is.tue.mpg.de/modellicense
SMAL [47] https://smal.is.tue.mpg.de/license
MG-Dataset [4] https://github.com/bharat-b7/MultiGarmentNetwork
FAUST [5] http://faust.is.tue.mpg.de/data_license
Table 5: Licenses of the assets used in this paper.

Appendix D Broader impact

We propose a novel method which has potential applications in animated movies and games by generating new poses for existing shapes with less human intervention. Our research can also have a positive impact in the vision and graphics community to inspire new ideas to generate meshes efficiently. However, new meshes generated by the model could be abused to synthesize fake content, which is the negative aspect of this research. Such issues have already drawn a lot of attention from the public. And many approaches have been proposed to solve them technologically and legally.