Latent feature disentanglement for 3D meshes

by   Jake Levinson, et al.

Generative modeling of 3D shapes has become an important problem due to its relevance to many applications across Computer Vision, Graphics, and VR. In this paper we build upon recently introduced 3D mesh-convolutional Variational AutoEncoders which have shown great promise for learning rich representations of deformable 3D shapes. We introduce a supervised generative 3D mesh model that disentangles the latent shape representation into independent generative factors. Our extensive experimental analysis shows that learning an explicitly disentangled representation can both improve random shape generation as well as successfully address downstream tasks such as pose and shape transfer, shape-invariant temporal synchronization, and pose-invariant shape matching.



There are no comments yet.


page 1

page 2

page 3

page 4


Neural 3D Morphable Models: Spiral Convolutional Networks for 3D Shape Representation Learning and Generation

Generative models for 3D geometric data arise in many important applicat...

Variational Autoencoders for Deforming 3D Mesh Models

3D geometric contents are becoming increasingly popular. In this paper, ...

Geometric Disentanglement for Generative Latent Shape Models

Representing 3D shape is a fundamental problem in artificial intelligenc...

NASA: Neural Articulated Shape Approximation

Efficient representation of articulated objects such as human bodies is ...

DSM-Net: Disentangled Structured Mesh Net for Controllable Generation of Fine Geometry

3D shape generation is a fundamental operation in computer graphics. Whi...

Cycle-Consistent Generative Rendering for 2D-3D Modality Translation

For humans, visual understanding is inherently generative: given a 3D sh...

Cerberus: A Multi-headed Derenderer

To generalize to novel visual scenes with new viewpoints and new object ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to generate new 3D shapes is a fundamentally important objective for many applications, especially in Virtual Reality where the availability of large collections of varied 3D assets is necessary to create rich virtual environments. Much of the recent progress in this area has been facilitated by the introduction of new large scale shape datasets such as ShapeNet (Chang et al. (2015)) and Dynamic FAUST (Bogo et al. (2017)

), which have made viable the approaches based on data-driven deep learning techniques (see for example

Wu et al. (2016); Achlioptas et al. (2018); Litany et al. (2018); Tan et al. (2018)). For many important applications it is not random shape generation that is desired but rather some user-controlled generation: the ability to manipulate an object by its parts (e.g. Borosan et al. (2012); Dubrovina et al. (2019)), or transfer pose characteristics across deformable shape instances (e.g. Sumner and Popović (2004); Gao et al. (2018)). These goals require a generative model that disentangles the underlying factors of variation in the data.

Learning disentangled representations is a well-studied problem in machine learning (

Schmidhuber (1992); Bengio et al. (2013); Locatello et al. (2018); Achille and Soatto (2018); Kim and Mnih (2018); Hinton et al. (2011); Chen et al. (2016)). In the most general context, the explanatory or generative factors are a priori unknown, so the goal of disentanglement is to learn latent factors that are mutually independent and that capture maximal variation in the data. Unsupervised approaches, however, make it difficult to control the interpretation of the disentangled factors. Indeed, many natural modes of variation such as shape and color may be highly correlated in training data, even when they describe semantically independent features.

In this work, we introduce a new generative model for 3D shapes that explicitly disentangles the shape representation by its observable generative factors. Our model builds upon the generative Variational AutoEncoders (Kingma and Welling (2014)) which have shown promising results for learning rich representations of deformable 3D meshes categories, e.g. humans, animals (Litany et al. (2018); Tan et al. (2018)). The model is trained on combinations of synthetic and real datasets where the variations of interest can be controlled during mesh generation. This allows us to generate large scale datasets with the necessary supervision (our model knows when training shapes share a latent factor). In addition to a dataset of articulated cylinders, we show results on a large scale dataset of approximately 3M human shapes exhibiting extreme pose and shape variation (following Varol et al. (2017)).111

Although a parametric model is used to generate our training data, our model is agnostic to this (it only sees the 3D meshes) and can thus scale trivially to non-parametric shape datasets.

Our evaluations include an analysis of the model’s latent disentanglement properties and experiments for several downstream applications: shape and pose transfer, temporal synchronization, and pose-independent shape matching.

Along with disentanglement, we improve the core performance of the basic Mesh VAE by incorporating a distortion-sensitive loss term that promotes more realistic shape generation, and an alternative technique for latent sampling that can overcome overparameterization of latent spaces (since the optimal latent dimensionality is typically unknown). One insight from our experiments is that a disentangled model can outperform a vanilla model with the same base architecture and generative capacity. This validates the hypothesis that disentangled models learn compact, robust representations.

One surprise in our results is that certain training modes lead to models that are disentangled from the generative standpoint, but not for inference – i.e., the latent representation itself is ‘entangled’, but the generator learns to disregard redundant or irrelevant latent information. Our primary model, however, is disentangled for both use cases.

2 Related Work

In contrast to the unsupervised disentangling models discussed above, for learning from visual data latent factors are often observable and in some way explicitly supervised. Model training may exploit temporal structure (e.g. in videos, Denton and Birodkar (2017); Villegas et al. (2017)), or generation of synthetic data with controlled latent factors (Kulkarni et al. (2015); Worrall et al. (2017); Yang et al. (2015)). Our approach most closely relates to the Inverse Graphics Network (Kulkarni et al. (2015)) which manipulates factors of variation within training mini-batches. This approach requires knowing which generative factors are being varied, but does not require supervision of the explicit parametric transformations as in Worrall et al. (2017) and Yang et al. (2015).

A number of recent works explore learning disentangled generative models (Variational AutoEncoders, Kingma and Welling (2014), Generative Adversarial Networks, Goodfellow et al. (2014)) where the latent representation is decomposed into an observed (potentially interpretable) component and a component for the remaining variability (Kingma et al. (2014); Narayanaswamy et al. (2017); de Bem et al. (2018); Mathieu et al. (2016)). In these approaches, the interpretable latent factors (e.g. class label or human pose) typically require direct supervision with a regression or classification loss.

While there are many generative models for 3D data such as volumes, point clouds, and meshes (Wu et al. (2016); Achlioptas et al. (2018); Tan et al. (2018); Litany et al. (2018)), disentangled models, in particular generative models, are an under-explored area. Recently Dubrovina et al. (2019) learns a part-aware factorized embedding space. Shapes can be generated by manipulating object parts, but the model generates volumetric shapes.

In addition to the works described above, it is important to note disentangled representations have been explored for numerous applications related to image data. Although a full review is out of scope here, to highlight different applications we refer the reader to topics on face images (Liu et al. (2018); Tran et al. (2017); Shu et al. (2017, 2018)), intrinsic image decomposition (Barrow and Tenenbaum (1978); Barron and Malik (2015)), and characteristic transfer across images (e.g. motion, Chan et al. (2018), appearance, Zanfir et al. (2018), and domain, Usman et al. (2019)).

Articulated shape models.

There is a significant body of work in representation learning for deformable articulated 3D shapes, notably of humans and animals. There are several parametric human shape models that capture the intrinsic human shape variation (Anguelov et al. (2005); Allen et al. (2006); Yang et al. (2014); Pishchulin et al. (2017)). Such approaches align a human mesh template to a set of 3D human scans, such as CAESAR (Robinette et al. (1999)), and compute the principal components on mesh vertex displacements or transformation matrices. To represent various human pose shapes, parametric skeleton skinning based approaches and deformation based approaches have been used. Skinning based approaches such as SMPL (Loper et al. (2015)) and Allen et al. (2006) compute vertex positions from the body pose using learnt skinning weights. Deformation-based approaches such as SCAPE (Anguelov et al. (2005)), Freifeld and Black (2012)Hasler et al. (2009); Hirshberg et al. (2012) use various representations of deformations to a reference mesh. More recently, rotation invariant (Gao et al. (2016)) and as-consistent-as-possible (Gao et al. (2017)

) deformation features have been used in mesh convolutional neural networks to extract a deformation embedding (

Tan et al. (2018)) and perform unpaired shape deformation transfer using 3D shape CycleGAN (Gao et al. (2018)). In contrast, our work focuses on explicit shape and pose latent feature disentanglement for general articulated meshes. To capture a natural distribution of human poses, several 3D human animation datasets have been collected. SURREAL (Varol et al. (2017)) performs SMPL fits to CAESAR shapes and activities from  CMU MoCap (1999) using  Loper et al. (2014). Pons-Moll et al. (2015) and Bogo et al. (2017) provide direct scans from humans performing various activities. Finally, there is work on capturing 3D shapes of animals, including parametric deformable models (Cashman and Fitzgibbon (2012); Zuffi et al. (2018)), and part based representations (Ntouskos et al. (2015)).

3 Generating 3D Shapes

3.1 Variational autoencoding

Variational autoencoders (VAEs) are a widely-used framework for generative modeling. A VAE assumes that data

is jointly distributed with certain latent variables

, which are typically given an independent Gaussian prior, . To infer from , we model the posterior distribution by an encoder , which we take to be a neural network. Similarly, we model the likelihood by a decoder network

, which allows the model to be used generatively. Training a VAE consists of approximately minimizing the KL divergence of the estimated posterior

from the true posterior , by maximizing the so-called Evidence Lower-Bound (ELBO). For more on VAE training, see Kingma and Welling (2014).

3.2 MeshVAE and the disentangled model

We based our model on the mesh variational autoencoder (MeshVAE) of Litany et al. (2018) (in principle our contributions could be incorporated into any similar Mesh VAE model e.g. Tan et al. (2018)). The MeshVAE acts on input data consisting of per-vertex features on a mesh, i.e. an input is , where there are vertices and features (for us, , the vertex coordinates). The model outputs global latent parameters . The architecture relies crucially on the mesh topology and is entirely convolutional, except for a single initial (fully-connected) decoding layer mapping the latent encoding to a set of per-vertex hidden features.

The architecture is as follows:

  1. The encoder uses feature-steered mesh convolutions (FeaStNet, see Verma et al. (2018)), followed by mean-pooling along vertices. We model the posterior

    as a Gaussian, so that the encoder gives a latent mean and variance

    in , where , correspond to the number of shape and pose features.

  2. For VAE training, we sample a latent feature , consisting of a shape feature and a pose feature . At inference, we simply use .

  3. The decoder generates per-vertex hidden features from one fully-connected layer, then applies a sequence of FeaStNet convolutional layers.

3.3 MeshVAE-D: Training for disentanglement

A baseline MeshVAE produces an ‘entangled’ latent encoding, which affords little or no control in shape generation. The goal for the disentangled model, MeshVAE-D, is for the latent features to capture shape and pose separately, and we took three steps to this end.

Batching. We structured the training set (SMPL) into doubly-supervised training batches, allowing us to train the model while fully supervising the desired factors of variation. We first structured the dataset into pairs of meshes, with each pair having either the same underlying body shape (i.e. subject identity) or the same pose (cf. the supervision techniques in Kulkarni et al. (2015); Worrall et al. (2017); Yang et al. (2015)). Each training batch then consisted of shape-constant or pose-constant pairs of meshes. For Faust shapes (Bogo et al. (2017)), pose labels are not available, so we only used shape-constant batches. Notably, despite only having access to partial supervision on Faust, the trained model successfully extracts pose and shape features from Faust meshes and is able to conduct pose transfer (see Fig. 7).

Clamping. For a pair of meshes from a training batch, the encoder produces latent features and

. During training, for shape-constant pairs, we replaced the latent shape vectors by their joint mean

before passing them to the decoder. For pose-constant pairs, we instead clamped the latent pose vectors to .

Latent variance loss. We added a loss term equal to the within-batch variance in the clamped latent feature: for shape-constant pairs and for pose-constant pairs.

Our clamping approach is similar to Kulkarni et al. (2015)

, which not only averaged the latent features but stops gradients from passing through the clamped neurons. With the latter approach, since the pose encodes much more information than the body shape, it becomes necessary to train with a higher proportion (5-to-1) of shape-constant (i.e., pose-varying) batches. We found that stopping gradients had a mild negative impact on model performance, so our model does not do it.

3.4 Loss and regularizers

The VAE training loss consisted of two terms: reconstruction error , plus the KL divergence loss term of the latent mean and variance from a Gaussian . For disentanglement, we included the latent variance loss defined above. As an additional regularization to improve surface smoothness, we introduced a geometric distortion loss based on (Pauly et al., 2005, Eq. (3)):


where is the reconstruction displacement at . The distortion loss penalizes distortion between neighboring vertices, apart from a common translation relative to the base mesh. The resulting meshes are more realistic in terms of both surface texture and fine detail (see Fig. 5); moreover, generated meshes from this model retain smoothness even as the generated shape variation goes beyond the range of shapes seen during training (see Fig. 4

). In sum, the loss function was

and we used .

3.5 Models

We compared our model to the following baselines: (1) an unmodified Mesh VAE, (2) model trained directly to do pose transfer between meshes, and (3) a model based on latent feature permutation during training (MeshVAE-P).

3.6 Transfer model baseline

We trained the Transfer model directly on a pose transfer task. We constructed a dataset of triples taken from SMPL, where the second and third meshes have, respectively, the same pose (but a different subject) and the same subject (but a different pose) as the target. The model is shown and asked to predict . We used an architecture similar to the MeshVAE, with two encoders, one for shape and one for pose, with respectively and the hidden layer widths compared to the full model. The dimensions of the latent space and decoder were left unchanged, and we did not clamp the latent vectors.

3.7 Permute model baseline (MeshVAE-P)

We trained the Permute model without clamping and variance loss, instead permuting latent features of batch pairs during training. That is, we swap and in a shape-constant batch (or and in a pose-constant batch) before passing to the decoder. By construction, the exchanged latent features still describe the same true meshes, so the decoder learns to reconstruct the same output mesh. MeshVAE-P produces decoded meshes of similar quality to MeshVAE-D, but the latent features themselves are poorly disentangled: the shape vector ends up carrying pose information – in fact more pose information than shape information (see 3) – which the decoder learns to ignore (see 2). This baseline highlights a key distinction between generative disentanglement (possible using MeshVAE-P or MeshVAE-D) and inferential disentanglement (only possible with MeshVAE-D). Indeed, MeshVAE-P model performs closer to the baseline on an inference task related to shape.

4 Experiments

4.1 Articulated cylinders

We first trained our model on a toy dataset consisting of meshes shaped as cylinders with a single bend of angle (1 pose parameter) and varying arm lengths and radius (3 shape parameters), see Fig. 1. For train/test splits, we held out a range of values for each parameter (see appendix).

0.9628 -0.0146 -0.0065 0.0215
-0.0192 0.9992 - -
-0.0047 - 0.9953 -
-0.0034 - - 0.9801
Figure 1: Left: An articulated cylinder. Right: Pearson’s correlation between pose () and shape parameters for ground truth (columns) and MeshVAE-D reconstructions (rows, using estimated parameters). Cross-correlations for pairs of shape features are omitted.

The resulting models successfully disentangle the cylinder shape from the pose angle. We performed pose and shape exchanges by swapping latent features between cylinders with different shape and pose parameters, then recovered the latent parameters using a least squares fit of cylinders to corresponding vertex positions of the decoded shapes. We computed Pearson’s correlation between the ground truth and estimated parameters, finding successful semantic disentanglement, with strong correlations between latent features and weak correlation across latent features.

4.2 Human shapes

Next, we trained a model on a combined dataset of human shapes. We combined shapes from the parametric mesh dataset SMPL and shapes from Faust, which consists of motion-captured meshes, labeled by subject identity but not pose. Training batches consisted of 8 pairs of meshes, alternating between SMPL and Faust in a ratio of 5:5:1 (SMPL shape-constant batches : SMPL pose-constant batches : Faust batches). Note that Faust batches are always shape-constant because Faust does not have pose labels.

For train/test splits, we held out two subjects and one activity from Faust; for SMPL, we held out 100 pose sequences and all subjects whose leading four shape parameters fell within distance of the points ; the SMPL shape distribution overall was sampled uniformly from to in each parameter. This is a much broader distribution than used by SURREAL (Varol et al. (2017)) which samples from the unit Gaussian. This was necessary to generate a dataset with extreme variations in human shape. We will provide all the sampled shape parameters and details so the test and train datasets can be reproduced exactly.

For direct reconstruction of input meshes, the model shows improved vertex error relative to a baseline Mesh VAE (Fig. 2) and compared to a model trained directly on pose transfer.

Model MVE (cm)
MeshVAE 3.6
MeshVAE-D (Ours) 2.7
MeshVAE-P (Permute) 2.8
Transfer model 3.7
Model MVE (cm)
MeshVAE n/a
MeshVAE-D (Ours)
MeshVAE-P (Permute)
Transfer model
Figure 2: Left: Mean vertex error for direct mesh reconstruction on SMPL test set. Right: Mean vertex error across 500 examples in the pose transfer experiment.

4.3 Encoder disentanglement

We first assessed the quality of the latent encoding itself. We stored the latent shape and pose encodings for all meshes, then calculated the distance between latent pose encodings for: (a) pairs of meshes in the same pose, (b) pairs of meshes of the same shape, and (c) random pairs of meshes.

In a perfectly disentangled model, the distances (a) should be zero, while (b) and (c) should have similar distributions of latent distances. In practice, we instead observe for MeshVAE-D. We then repeated the calculation with latent shape encodings, where we expect the reverse, i.e. . The distribution of distances is shown in the histograms in Fig. 3, showing good disentanglement for our clamped model MeshVAE-D. By contrast, MeshVAE-P, trained by permuting rather than clamping the latent features, is poorly disentangled: the latent shape encoding is more responsive to pose than to shape! In particular, latent shape proximity in this model is more indicative of pose alignment () than shape similarity.

Figure 3: Distribution of latent distance between encodings of pairs of meshes with either the same shape, the same pose, or neither (random pairs). Distance in is scaled by .

4.4 Decoder disentanglement

To assess disentanglement on the level of the decoder, we attempted to generate shapes while holding one or the other feature fixed.

For fixed-pose, variable-shape generation, we took a pose encoding from a fixed real mesh and generated shape encodings where was the observed latent scale across the dataset. For random pose generation, since our model overparametrizes the true pose, the latent pose distribution does not, in practice, occupy the entire latent (pose) space – making it a challenge to generate suitable random poses. To get around this problem, we examined the model’s latent pose distribution using a modified PCA, computed at inference from the pose encodings of 120k random training meshes. The top 80 principal components account for of the latent pose distribution variance. We then generated Gaussian random pose vectors

within these principal axes (weighted by the singular values) and combined them with a fixed shape encoding

from a fixed mesh from the test set. See Fig. 4. A more detailed analysis of performance relative to latent dimensions is in the appendix.

Note that the level of variation (particularly in body shape) in the generated meshes goes beyond that of the training set. We view these results as compound benefits of having both a disentangled and geometrically-based model: we are able to vary the mesh in a controlled way, while our geometric priors, such as the distortion term (1), ensure that variation in vertex predictions is smoothed out locally to form plausible mesh deformations, rather than just degrading the mesh (see Fig. 5 right).

Figure 4: Meshes from Gaussian random shape vectors and fixed real pose encoding (left column) or vice versa (right column). Our method can generate plausible shapes well outside the SMPL distribution.
Model top1 top2 top3
MeshVAE 31.8 39.4 43.8
MeshVAE-D (Ours) 47.8 57.6 63.2
MeshVAE-P (Permute) 46.0 55.3 59.9
Figure 5: Left: Top- score on shape recognition task. Right: Impact of the distortion loss term on reconstructed mesh quality. Left is MeshVAE-D, right standard MeshVAE.

4.5 Pose and shape transfer

A primary application of a disentangled model is to transfer poses and body shapes from one mesh to another. We used the dataset of triples constructed for the pose transfer baseline (see Sec. 4.2). While the baseline model is trained directly on this task, the disentangled model instead produces by combining the appropriate latent attributes from and . See Fig. 7.

To ensure that the task presented a challenge for the model, we tested on a subset of triples, requiring the secondary meshes to have extrinsic mean vertex distance from of at least cm for and cm for . Surprisingly, the primary model outperformed the model trained directly on triples (Fig. 2).

4.6 Pose synchronization on Faust

Next, we evaluated the model on a pose synchronization task using dynamic time warping (DTW, Sakoe and Chiba (1978)). Given two sequences and costs (energies) DTW produces a sequence of pairs such that , , and every is matched to at least one and vice versa, minimizing the total energy. We performed the synchronization task using cost given by distance between latent pose encodings for and . This task is especially interesting on Faust because the sequences consist of similar motions (jumping jacks, running on the spot, and so on) but are not synchronized and do not have pose labels. See Fig. 6.

4.7 Shape recognition on SMPL

As a complementary experiment to pose synchronization, we performed a shape recognition task: we selected subjects at random from the test set, and for each subject chose two random meshes . We then had the model predict, for each , which of the counterparts comes from the same subject. We used nearest-neighbors assignment based on the shape encoding only (MeshVAE-D, MeshVAE-P) or the overall latent encoding (MeshVAE-D). Our model significantly outperforms the baseline on this task (Fig. 5), correctly identifying nearly half the meshes (top-1 score) and 2/3 (top-3 score). We also observe a moderate improvement over MeshVAE-P, for which the latent representation is only partially disentangled.

4.8 Latent pose interpolation

To explore the local structure of the model’s latent space, we took motion sequences and conducted pose interpolation between nearby frames, using linear interpolation in the latent space. The interpolation results in additional semantic detail, such as arm-bending between two poses with almost straight arms. See Fig.

7 for a comparison with naive extrinsic linear interpolation of vertex coordinates in , which causes unrealistic mesh deformations.

Figure 6: Pose synchronization based on the latent pose encoding. Top: Selected frames from original motion sequences (580 and 1250 frames). Bottom: Selected frames from dynamically synchronized sequences (1251 frames).
target swap direct
(n/a) (n/a)
Figure 7: Left: Transfer experiment. The model is given and and combines latent features to predict . The rightmost column shows, for comparison, the result of directly encoding and decoding itself. The third row shows an example from Faust, which does not have ground truth for pose swapping. Right: Interpolation experiment. Comparison of latent linear interpolation in feature space (top), compared to extrinsic linear interpolation in (bottom).

5 Conclusion and Future Work

This paper introduces a disentangled mesh-convolutional VAE. With careful consideration of the supervision and training design, we see that our proposed model can achieve accurate disentanglement while capturing the varied pose and shape properties in large-scale mesh datasets.

Given these promising results, in future work we will explore two directions: (1) model improvements by extending our current design by incorporating techniques from alternate generative approaches (e.g. VAE-GAN), and (2) domain transfer to 3D data captured in the wild (e.g. captured with commodity RGBD sensors).


  • Achille and Soatto [2018] A. Achille and S. Soatto. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19(50):1–34, 2018. URL
  • Achlioptas et al. [2018] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3D point clouds. In International Conference on Machine Learning, (ICML), volume 80, pages 40–49, 2018.
  • Allen et al. [2006] B. Allen, B. Curless, Z. Popović, and A. Hertzmann. Learning a correlated model of identity and pose-dependent body shape variation for real-time synthesis. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’06, pages 147–156, Aire-la-Ville, Switzerland, Switzerland, 2006. Eurographics Association. ISBN 3-905673-34-7. URL
  • Anguelov et al. [2005] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. Scape: Shape completion and animation of people. In ACM SIGGRAPH 2005 Papers, SIGGRAPH ’05, pages 408–416, New York, NY, USA, 2005. ACM. doi: 10.1145/1186822.1073207. URL
  • Barron and Malik [2015] J. T. Barron and J. Malik. Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell., 37(8):1670–1687, 2015.
  • Barrow and Tenenbaum [1978] H. G. Barrow and J. M. Tenenbaum. Recovering intrinsic scene characteristics from images. In Computer Vision Systems, 1978.
  • Bengio et al. [2013] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, 2013.
  • Bogo et al. [2017] F. Bogo, J. Romero, G. Pons-Moll, and M. J. Black. Dynamic FAUST: Registering human bodies in motion. In

    IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

    , July 2017.
  • Borosan et al. [2012] P. Borosan, M. Jin, D. DeCarlo, Y. Gingold, and A. Nealen. RigMesh: Automatic rigging for part-based shape modeling and deformation. ACM Transactions on Graphics (TOG), 31(6):198:1–198:9, 2012.
  • Cashman and Fitzgibbon [2012] T. J. Cashman and A. W. Fitzgibbon. What shape are dolphins? building 3d morphable models from 2d images. IEEE transactions on pattern analysis and machine intelligence, 35(1):232–244, 2012.
  • Chan et al. [2018] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance now. CoRR, abs/1808.07371, 2018. URL
  • Chang et al. [2015] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model repository. CoRR, abs/1512.03012, 2015.
  • Chen et al. [2016] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Neural Information Processing Systems, pages 2180–2188, 2016.
  • CMU MoCap [1999] CMU MoCap. Carnegie-mellon mocap database, 1999. URL
  • de Bem et al. [2018] R. de Bem, A. Ghosh, T. Ajanthan, O. Miksik, N. Siddharth, and P. H. Torr. Dgpose: Disentangled semi-supervised deep generative models for human body analysis. arXiv preprint arXiv:1804.06364, 2018.
  • Denton and Birodkar [2017] E. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, pages 4414–4423, 2017.
  • Dubrovina et al. [2019] A. Dubrovina, F. Xia, P. Achlioptas, M. Shalah, and L. J. Guibas. Composite shape modeling via latent space factorization. CoRR, abs/1901.02968, 2019. URL
  • Freifeld and Black [2012] O. Freifeld and M. J. Black. Lie bodies: A manifold representation of 3d human shape. In European Conference on Computer Vision, pages 1–14. Springer, 2012.
  • Gao et al. [2016] L. Gao, Y.-K. Lai, D. Liang, S.-Y. Chen, and S. Xia. Efficient and flexible deformation representation for data-driven surface modeling. ACM Transactions on Graphics (TOG), 35(5):158, 2016.
  • Gao et al. [2017] L. Gao, Y.-K. Lai, J. Yang, L.-X. Zhang, L. Kobbelt, and S. Xia. Sparse data driven mesh deformation. arXiv preprint arXiv:1709.01250, 2017.
  • Gao et al. [2018] L. Gao, J. Yang, Y.-L. Qiao, Y.-K. Lai, P. L. Rosin, W. Xu, and S. Xia. Automatic unpaired shape deformation transfer. ACM Trans. Graph., 37(6):237:1–237:15, Dec. 2018. ISSN 0730-0301. doi: 10.1145/3272127.3275028. URL
  • Goodfellow et al. [2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
  • Hasler et al. [2009] N. Hasler, C. Stoll, M. Sunkel, B. Rosenhahn, and H.-P. Seidel. A statistical model of human pose and body shape. In Computer graphics forum, volume 28, pages 337–346. Wiley Online Library, 2009.
  • Hinton et al. [2011] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN), pages 44–51, 2011.
  • Hirshberg et al. [2012] D. A. Hirshberg, M. Loper, E. Rachlin, and M. J. Black. Coregistration: Simultaneous alignment and modeling of articulated 3d shape. In European Conference on Computer Vision, pages 242–255. Springer, 2012.
  • Kim and Mnih [2018] H. Kim and A. Mnih. Disentangling by factorising. In International Conference on Machine Learning, pages 2649–2658, 2018.
  • Kingma and Welling [2014] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.
  • Kingma et al. [2014] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
  • Kulkarni et al. [2015] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Advances in neural information processing systems, pages 2539–2547, 2015.
  • Litany et al. [2018] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. Deformable shape completion with graph convolutional autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1886–1895, 2018.
  • Liu et al. [2018] Y. Liu, F. Wei, J. Shao, L. Sheng, J. Yan, and X. Wang. Exploring disentangled feature representation beyond face identification. In CVPR, 2018.
  • Locatello et al. [2018] F. Locatello, S. Bauer, M. Lucic, S. Gelly, B. Schölkopf, and O. Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. CoRR, abs/1811.12359, 2018.
  • Loper et al. [2015] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015.
  • Loper et al. [2014] M. M. Loper, N. Mahmood, and M. J. Black. MoSh: Motion and shape capture from sparse markers. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 33(6):220:1–220:13, Nov. 2014. URL
  • Mathieu et al. [2016] M. Mathieu, J. J. Zhao, P. Sprechmann, A. Ramesh, and Y. LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pages 5041–5049, 2016.
  • Narayanaswamy et al. [2017] S. Narayanaswamy, T. B. Paige, J.-W. van de Meent, A. Desmaison, N. Goodman, P. Kohli, F. Wood, and P. Torr. Learning disentangled representations with semi-supervised deep generative models. In Advances in Neural Information Processing Systems, pages 5925–5935, 2017.
  • Ntouskos et al. [2015] V. Ntouskos, M. Sanzari, B. Cafaro, F. Nardi, F. Natola, F. Pirri, and M. Ruiz. Component-wise modeling of articulated objects. In Proceedings of the IEEE International Conference on Computer Vision, pages 2327–2335, 2015.
  • Pauly et al. [2005] M. Pauly, N. J. Mitra, J. Giesen, M. Gross, and L. J. Guibas. Example-based 3d scan completion. In Proceedings of the Third Eurographics Symposium on Geometry Processing, SGP ’05, Aire-la-Ville, Switzerland, Switzerland, 2005. Eurographics Association. ISBN 3-905673-24-X. URL
  • Pishchulin et al. [2017] L. Pishchulin, S. Wuhrer, T. Helten, C. Theobalt, and B. Schiele. Building statistical shape spaces for 3d human modeling. Pattern Recognition, 2017.
  • Pons-Moll et al. [2015] G. Pons-Moll, J. Romero, N. Mahmood, and M. J. Black. Dyna: A model of dynamic human shape in motion. ACM Transactions on Graphics, (Proc. SIGGRAPH), 34(4):120:1–120:14, Aug. 2015.
  • Robinette et al. [1999] K. Robinette, H. Daanen, and E. Paquet. The caesar project: a 3-d surface anthropometry survey. In 3D Imaging and Modelling, pages 380 – 386, 1999.
  • Sakoe and Chiba [1978] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 26:43–49, 1978.
  • Schmidhuber [1992] J. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4:863–879, 1992.
  • Shu et al. [2017] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In IEEE Conference on Computer Vision and Pattern Recognition, 2017, 2017.
  • Shu et al. [2018] Z. Shu, M. Sahasrabudhe, R. A. Güler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance. In The European Conference on Computer Vision, (ECCV), 2018.
  • Sumner and Popović [2004] R. W. Sumner and J. Popović. Deformation transfer for triangle meshes. ACM Trans. Graph., 23(3):399–405, 2004.
  • Tan et al. [2018] Q. Tan, L. Gao, Y. Lai, and S. Xia. Variational autoencoders for deforming 3d mesh models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5841–5850, 2018.
  • Tran et al. [2017] L. Tran, X. Yin, and X. Liu. Disentangled representation learning gan for pose-invariant face recognition. In In Proceeding of IEEE Computer Vision and Pattern Recognition, Honolulu, HI, July 2017.
  • Usman et al. [2019] B. Usman, N. Dufour, K. Saenko, and C. Bregler. Puppetgan: Transferring disentangled properties from synthetic to real images. CoRR, abs/1901.10024, 2019. URL
  • Varol et al. [2017] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid. Learning from synthetic humans. In CVPR, 2017.
  • Verma et al. [2018] N. Verma, E. Boyer, and J. Verbeek. Feastnet: Feature-steered graph convolutions for 3d shape analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2598–2606, 2018.
  • Villegas et al. [2017] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. In International Conference on Learning Representations, 2017.
  • Worrall et al. [2017] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Interpretable transformations with encoder-decoder networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • Wu et al. [2016] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
  • Yang et al. [2015] J. Yang, S. E. Reed, M.-H. Yang, and H. Lee. Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. In Advances in Neural Information Processing Systems, pages 1099–1107, 2015.
  • Yang et al. [2014] Y. Yang, Y. Yu, Y. Zhou, S. Du, J. Davis, and R. Yang. Semantic parametric reshaping of human body models. In 2014 2nd International Conference on 3D Vision, volume 2, pages 41–48. IEEE, 2014.
  • Zanfir et al. [2018] M. Zanfir, A.-I. Popa, A. Zanfir, and C. Sminchisescu. Human appearance transfer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • Zuffi et al. [2018] S. Zuffi, A. Kanazawa, and M. J. Black. Lions and tigers and bears: Capturing non-rigid, 3D, articulated shape from images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2018.

Appendix A Appendix

a.1 Network architecture

Our network architecture used the hidden layer widths indicated in Fig. S8

. All models were trained in Tensorflow with an Adam optimizer with learning rate decaying exponentially from

to over 400k steps. The articulated cylinder models typically converged within 100k steps.

Cylinders Layers / Units
Encoder 16 (1x1 conv), 24, 32, 48, 64 (FeaStNet), mean-pooling
Latent space 4N = 3N (shape) + 1N (pose), N = 1, 2, 3
Decoder (FC), 64, 48, 32, 24, 16 (FeaStNet), 3 (1x1 conv)
Human shapes Layers / Units
Encoder 16 (1x1 conv), 32, 64, 96, 128 (FeaStNet), mean-pooling
Latent space 128 = 16 (shape) + 112 (pose)
Decoder (FC), 128, 96, 64, 32, 16 (FeaStNet), 3 (1x1 conv)
Discriminator 16 (1x1 conv), 32, 64 (FeaStNet), mean-pooling, then 1 (FC)
Figure S8: Network architecture.

a.2 Experiments

a.2.1 Articulated cylinders

We evaluated disentanglement on the articulated cylinders by evaluating the explicit shape and pose parameters from the reconstructed meshes (see Section 4.1 of paper). For test set, we used holdout ranges of parameters, . For a holdout set of meshes, we compared latent RMSE (see Fig. S9) and Pearson’s correlation (see Fig. S10) for direct reconstructions vs random shape-pose swaps. We noticed the error in shape latent was almost identical, whereas pose latent exhibited a larger error after swap.

Parameter Latent swap Direct
10.25375 1.5044
0.04784 0.04711
0.06322 0.0382
0.00545 0.0051
Figure S9: Left: An articulated cylinder. Right: Root mean squared error of estimated latent parameters compared to ground truth, for cylinders decoded directly or following latent feature swap.
0.99961 -0.01074 -0.01690 -0.00540
-0.00085 0.99973 - -
0.00705 - 0.99919 -
-0.02340 - - 0.98127
Figure S10: Pearson’s correlation between pose () and shape parameters for ground truth (columns) and MeshVAE-D direct reconstructions, i.e., without using pose swapping (rows, using estimated parameters). Cross-correlations for pairs of shape features are omitted. Compare to Fig. 1, which shows correlations using estimate latent parameters after reconstruction using pose swaps.

a.2.2 Discussion on PCA reduction for latent pose generation for human shapes

Our model overparametrizes the latent pose vector (see Fig. S11

), offering the possibility of reducing the latent dimensionality after training has finished. For random pose generation, we found it beneficial to first reduce the latent pose space to 80 latent dimensions, by conducting a principal component analysis of the empirical latent distribution (based on the latent pose encodings of approximately 120k meshes from the training set; we found similar results using PCA computed from 40k and 220k meshes) and sampling latent vectors from the top principal components.

A natural question is whether this dimensionality reduction amounts to training with fewer latent features. We compared MeshVAE-D, with 112 latent pose features (called MeshVAE-D-112), to a smaller MeshVAE-D with 80 latent pose features (called MeshVAE-D-80), and found significant impairment to the reconstructed meshes from the smaller model. Specifically, MeshVAE-D-80 had higher mean vertex error for reconstructed meshes (cm MVE, compared to cm MVE for MeshVAE-D-112), and larger latent embedding variance loss (between meshes with common shape or pose). In contrast, applying the latent PCA projection to MeshVAE-D-112 had negligible impacts on mesh reconstruction, increasing the reconstructed mesh MVE by only cm (compared to MVE=cm for MeshVAE-D-112 without PCA projection). We also note that the PCA on MeshVAE-D-112 transformed the latent pose vectors by of their norm. Thus, training a model with smaller latent features results in larger reconstruction and disentanglement errors compared to PCA reduction on the latent features of a larger model.

Figure S11: Singular value plot for the latent pose distribution of MeshVAE-D, trained on human shapes. PCA estimated from 220k random training meshes. The top 80 principal components account for of the variance.

a.2.3 High-res images of mesh images

For the reader’s enjoyment, we include larger images of the human meshes from Figures 3-7. Meshes colored in blue are originals; grey and white meshes are model outputs.

Figure S12: Impact of the distortion loss term on reconstructed mesh quality. From left to right: original mesh, MeshVAE-D output, standard MeshVAE output.
Figure S13: Meshes from random shape vectors and fixed real pose encoding (rows 1-2) or vice versa (3-4).
diff_pose diff_subject target combine direct
(n/a) (n/a)
Figure S14: Transfer experiment. The model is given and and combines latent features to predict . The rightmost column shows, for comparison, the result of directly encoding and decoding itself. The third row shows an example from Faust, which does not have ground truth for pose swapping.
Figure S15: Interpolation experiment. Interpolating the latent pose encoding, compared to extrinsic linear interpolation in .
Figure S16: Top: Pose synchronization based on the latent pose encoding. Original sequences are shown in blue. Dynamically synchronized sequences shown in gray. Entire sequences shown, in miniature. See next figure for enlarged view.
Figure S17: Pose synchronization based on the latent pose encoding. Original sequences are shown in blue. Dynamically synchronized sequences shown in gray.