1 Introduction
The ability to generate new 3D shapes is a fundamentally important objective for many applications, especially in Virtual Reality where the availability of large collections of varied 3D assets is necessary to create rich virtual environments. Much of the recent progress in this area has been facilitated by the introduction of new large scale shape datasets such as ShapeNet (Chang et al. (2015)) and Dynamic FAUST (Bogo et al. (2017)
), which have made viable the approaches based on datadriven deep learning techniques (see for example
Wu et al. (2016); Achlioptas et al. (2018); Litany et al. (2018); Tan et al. (2018)). For many important applications it is not random shape generation that is desired but rather some usercontrolled generation: the ability to manipulate an object by its parts (e.g. Borosan et al. (2012); Dubrovina et al. (2019)), or transfer pose characteristics across deformable shape instances (e.g. Sumner and Popović (2004); Gao et al. (2018)). These goals require a generative model that disentangles the underlying factors of variation in the data.Learning disentangled representations is a wellstudied problem in machine learning (
Schmidhuber (1992); Bengio et al. (2013); Locatello et al. (2018); Achille and Soatto (2018); Kim and Mnih (2018); Hinton et al. (2011); Chen et al. (2016)). In the most general context, the explanatory or generative factors are a priori unknown, so the goal of disentanglement is to learn latent factors that are mutually independent and that capture maximal variation in the data. Unsupervised approaches, however, make it difficult to control the interpretation of the disentangled factors. Indeed, many natural modes of variation such as shape and color may be highly correlated in training data, even when they describe semantically independent features.In this work, we introduce a new generative model for 3D shapes that explicitly disentangles the shape representation by its observable generative factors. Our model builds upon the generative Variational AutoEncoders (Kingma and Welling (2014)) which have shown promising results for learning rich representations of deformable 3D meshes categories, e.g. humans, animals (Litany et al. (2018); Tan et al. (2018)). The model is trained on combinations of synthetic and real datasets where the variations of interest can be controlled during mesh generation. This allows us to generate large scale datasets with the necessary supervision (our model knows when training shapes share a latent factor). In addition to a dataset of articulated cylinders, we show results on a large scale dataset of approximately 3M human shapes exhibiting extreme pose and shape variation (following Varol et al. (2017)).^{1}^{1}1
Although a parametric model is used to generate our training data, our model is agnostic to this (it only sees the 3D meshes) and can thus scale trivially to nonparametric shape datasets.
Our evaluations include an analysis of the model’s latent disentanglement properties and experiments for several downstream applications: shape and pose transfer, temporal synchronization, and poseindependent shape matching.Along with disentanglement, we improve the core performance of the basic Mesh VAE by incorporating a distortionsensitive loss term that promotes more realistic shape generation, and an alternative technique for latent sampling that can overcome overparameterization of latent spaces (since the optimal latent dimensionality is typically unknown). One insight from our experiments is that a disentangled model can outperform a vanilla model with the same base architecture and generative capacity. This validates the hypothesis that disentangled models learn compact, robust representations.
One surprise in our results is that certain training modes lead to models that are disentangled from the generative standpoint, but not for inference – i.e., the latent representation itself is ‘entangled’, but the generator learns to disregard redundant or irrelevant latent information. Our primary model, however, is disentangled for both use cases.
2 Related Work
In contrast to the unsupervised disentangling models discussed above, for learning from visual data latent factors are often observable and in some way explicitly supervised. Model training may exploit temporal structure (e.g. in videos, Denton and Birodkar (2017); Villegas et al. (2017)), or generation of synthetic data with controlled latent factors (Kulkarni et al. (2015); Worrall et al. (2017); Yang et al. (2015)). Our approach most closely relates to the Inverse Graphics Network (Kulkarni et al. (2015)) which manipulates factors of variation within training minibatches. This approach requires knowing which generative factors are being varied, but does not require supervision of the explicit parametric transformations as in Worrall et al. (2017) and Yang et al. (2015).
A number of recent works explore learning disentangled generative models (Variational AutoEncoders, Kingma and Welling (2014), Generative Adversarial Networks, Goodfellow et al. (2014)) where the latent representation is decomposed into an observed (potentially interpretable) component and a component for the remaining variability (Kingma et al. (2014); Narayanaswamy et al. (2017); de Bem et al. (2018); Mathieu et al. (2016)). In these approaches, the interpretable latent factors (e.g. class label or human pose) typically require direct supervision with a regression or classification loss.
While there are many generative models for 3D data such as volumes, point clouds, and meshes (Wu et al. (2016); Achlioptas et al. (2018); Tan et al. (2018); Litany et al. (2018)), disentangled models, in particular generative models, are an underexplored area. Recently Dubrovina et al. (2019) learns a partaware factorized embedding space. Shapes can be generated by manipulating object parts, but the model generates volumetric shapes.
In addition to the works described above, it is important to note disentangled representations have been explored for numerous applications related to image data. Although a full review is out of scope here, to highlight different applications we refer the reader to topics on face images (Liu et al. (2018); Tran et al. (2017); Shu et al. (2017, 2018)), intrinsic image decomposition (Barrow and Tenenbaum (1978); Barron and Malik (2015)), and characteristic transfer across images (e.g. motion, Chan et al. (2018), appearance, Zanfir et al. (2018), and domain, Usman et al. (2019)).
Articulated shape models.
There is a significant body of work in representation learning for deformable articulated 3D shapes, notably of humans and animals. There are several parametric human shape models that capture the intrinsic human shape variation (Anguelov et al. (2005); Allen et al. (2006); Yang et al. (2014); Pishchulin et al. (2017)). Such approaches align a human mesh template to a set of 3D human scans, such as CAESAR (Robinette et al. (1999)), and compute the principal components on mesh vertex displacements or transformation matrices. To represent various human pose shapes, parametric skeleton skinning based approaches and deformation based approaches have been used. Skinning based approaches such as SMPL (Loper et al. (2015)) and Allen et al. (2006) compute vertex positions from the body pose using learnt skinning weights. Deformationbased approaches such as SCAPE (Anguelov et al. (2005)), Freifeld and Black (2012), Hasler et al. (2009); Hirshberg et al. (2012) use various representations of deformations to a reference mesh. More recently, rotation invariant (Gao et al. (2016)) and asconsistentaspossible (Gao et al. (2017)
) deformation features have been used in mesh convolutional neural networks to extract a deformation embedding (
Tan et al. (2018)) and perform unpaired shape deformation transfer using 3D shape CycleGAN (Gao et al. (2018)). In contrast, our work focuses on explicit shape and pose latent feature disentanglement for general articulated meshes. To capture a natural distribution of human poses, several 3D human animation datasets have been collected. SURREAL (Varol et al. (2017)) performs SMPL fits to CAESAR shapes and activities from CMU MoCap (1999) using Loper et al. (2014). PonsMoll et al. (2015) and Bogo et al. (2017) provide direct scans from humans performing various activities. Finally, there is work on capturing 3D shapes of animals, including parametric deformable models (Cashman and Fitzgibbon (2012); Zuffi et al. (2018)), and part based representations (Ntouskos et al. (2015)).3 Generating 3D Shapes
3.1 Variational autoencoding
Variational autoencoders (VAEs) are a widelyused framework for generative modeling. A VAE assumes that data
is jointly distributed with certain latent variables
, which are typically given an independent Gaussian prior, . To infer from , we model the posterior distribution by an encoder , which we take to be a neural network. Similarly, we model the likelihood by a decoder network, which allows the model to be used generatively. Training a VAE consists of approximately minimizing the KL divergence of the estimated posterior
from the true posterior , by maximizing the socalled Evidence LowerBound (ELBO). For more on VAE training, see Kingma and Welling (2014).3.2 MeshVAE and the disentangled model
We based our model on the mesh variational autoencoder (MeshVAE) of Litany et al. (2018) (in principle our contributions could be incorporated into any similar Mesh VAE model e.g. Tan et al. (2018)). The MeshVAE acts on input data consisting of pervertex features on a mesh, i.e. an input is , where there are vertices and features (for us, , the vertex coordinates). The model outputs global latent parameters . The architecture relies crucially on the mesh topology and is entirely convolutional, except for a single initial (fullyconnected) decoding layer mapping the latent encoding to a set of pervertex hidden features.
The architecture is as follows:

For VAE training, we sample a latent feature , consisting of a shape feature and a pose feature . At inference, we simply use .

The decoder generates pervertex hidden features from one fullyconnected layer, then applies a sequence of FeaStNet convolutional layers.
3.3 MeshVAED: Training for disentanglement
A baseline MeshVAE produces an ‘entangled’ latent encoding, which affords little or no control in shape generation. The goal for the disentangled model, MeshVAED, is for the latent features to capture shape and pose separately, and we took three steps to this end.
Batching. We structured the training set (SMPL) into doublysupervised training batches, allowing us to train the model while fully supervising the desired factors of variation. We first structured the dataset into pairs of meshes, with each pair having either the same underlying body shape (i.e. subject identity) or the same pose (cf. the supervision techniques in Kulkarni et al. (2015); Worrall et al. (2017); Yang et al. (2015)). Each training batch then consisted of shapeconstant or poseconstant pairs of meshes. For Faust shapes (Bogo et al. (2017)), pose labels are not available, so we only used shapeconstant batches. Notably, despite only having access to partial supervision on Faust, the trained model successfully extracts pose and shape features from Faust meshes and is able to conduct pose transfer (see Fig. 7).
Clamping. For a pair of meshes from a training batch, the encoder produces latent features and
. During training, for shapeconstant pairs, we replaced the latent shape vectors by their joint mean
before passing them to the decoder. For poseconstant pairs, we instead clamped the latent pose vectors to .Latent variance loss. We added a loss term equal to the withinbatch variance in the clamped latent feature: for shapeconstant pairs and for poseconstant pairs.
Our clamping approach is similar to Kulkarni et al. (2015)
, which not only averaged the latent features but stops gradients from passing through the clamped neurons. With the latter approach, since the pose encodes much more information than the body shape, it becomes necessary to train with a higher proportion (5to1) of shapeconstant (i.e., posevarying) batches. We found that stopping gradients had a mild negative impact on model performance, so our model does not do it.
3.4 Loss and regularizers
The VAE training loss consisted of two terms: reconstruction error , plus the KL divergence loss term of the latent mean and variance from a Gaussian . For disentanglement, we included the latent variance loss defined above. As an additional regularization to improve surface smoothness, we introduced a geometric distortion loss based on (Pauly et al., 2005, Eq. (3)):
(1) 
where is the reconstruction displacement at . The distortion loss penalizes distortion between neighboring vertices, apart from a common translation relative to the base mesh. The resulting meshes are more realistic in terms of both surface texture and fine detail (see Fig. 5); moreover, generated meshes from this model retain smoothness even as the generated shape variation goes beyond the range of shapes seen during training (see Fig. 4
). In sum, the loss function was
and we used .
3.5 Models
We compared our model to the following baselines: (1) an unmodified Mesh VAE, (2) model trained directly to do pose transfer between meshes, and (3) a model based on latent feature permutation during training (MeshVAEP).
3.6 Transfer model baseline
We trained the Transfer model directly on a pose transfer task. We constructed a dataset of triples taken from SMPL, where the second and third meshes have, respectively, the same pose (but a different subject) and the same subject (but a different pose) as the target. The model is shown and asked to predict . We used an architecture similar to the MeshVAE, with two encoders, one for shape and one for pose, with respectively and the hidden layer widths compared to the full model. The dimensions of the latent space and decoder were left unchanged, and we did not clamp the latent vectors.
3.7 Permute model baseline (MeshVAEP)
We trained the Permute model without clamping and variance loss, instead permuting latent features of batch pairs during training. That is, we swap and in a shapeconstant batch (or and in a poseconstant batch) before passing to the decoder. By construction, the exchanged latent features still describe the same true meshes, so the decoder learns to reconstruct the same output mesh. MeshVAEP produces decoded meshes of similar quality to MeshVAED, but the latent features themselves are poorly disentangled: the shape vector ends up carrying pose information – in fact more pose information than shape information (see 3) – which the decoder learns to ignore (see 2). This baseline highlights a key distinction between generative disentanglement (possible using MeshVAEP or MeshVAED) and inferential disentanglement (only possible with MeshVAED). Indeed, MeshVAEP model performs closer to the baseline on an inference task related to shape.
4 Experiments
4.1 Articulated cylinders
We first trained our model on a toy dataset consisting of meshes shaped as cylinders with a single bend of angle (1 pose parameter) and varying arm lengths and radius (3 shape parameters), see Fig. 1. For train/test splits, we held out a range of values for each parameter (see appendix).
Parameter  

0.9628  0.0146  0.0065  0.0215  
0.0192  0.9992      
0.0047    0.9953    
0.0034      0.9801 
The resulting models successfully disentangle the cylinder shape from the pose angle. We performed pose and shape exchanges by swapping latent features between cylinders with different shape and pose parameters, then recovered the latent parameters using a least squares fit of cylinders to corresponding vertex positions of the decoded shapes. We computed Pearson’s correlation between the ground truth and estimated parameters, finding successful semantic disentanglement, with strong correlations between latent features and weak correlation across latent features.
4.2 Human shapes
Next, we trained a model on a combined dataset of human shapes. We combined shapes from the parametric mesh dataset SMPL and shapes from Faust, which consists of motioncaptured meshes, labeled by subject identity but not pose. Training batches consisted of 8 pairs of meshes, alternating between SMPL and Faust in a ratio of 5:5:1 (SMPL shapeconstant batches : SMPL poseconstant batches : Faust batches). Note that Faust batches are always shapeconstant because Faust does not have pose labels.
For train/test splits, we held out two subjects and one activity from Faust; for SMPL, we held out 100 pose sequences and all subjects whose leading four shape parameters fell within distance of the points ; the SMPL shape distribution overall was sampled uniformly from to in each parameter. This is a much broader distribution than used by SURREAL (Varol et al. (2017)) which samples from the unit Gaussian. This was necessary to generate a dataset with extreme variations in human shape. We will provide all the sampled shape parameters and details so the test and train datasets can be reproduced exactly.
For direct reconstruction of input meshes, the model shows improved vertex error relative to a baseline Mesh VAE (Fig. 2) and compared to a model trained directly on pose transfer.
Model  MVE (cm) 

MeshVAE  3.6 
MeshVAED (Ours)  2.7 
MeshVAEP (Permute)  2.8 
Transfer model  3.7 
Model  MVE (cm) 

MeshVAE  n/a 
MeshVAED (Ours)  
MeshVAEP (Permute)  
Transfer model 
4.3 Encoder disentanglement
We first assessed the quality of the latent encoding itself. We stored the latent shape and pose encodings for all meshes, then calculated the distance between latent pose encodings for: (a) pairs of meshes in the same pose, (b) pairs of meshes of the same shape, and (c) random pairs of meshes.
In a perfectly disentangled model, the distances (a) should be zero, while (b) and (c) should have similar distributions of latent distances. In practice, we instead observe for MeshVAED. We then repeated the calculation with latent shape encodings, where we expect the reverse, i.e. . The distribution of distances is shown in the histograms in Fig. 3, showing good disentanglement for our clamped model MeshVAED. By contrast, MeshVAEP, trained by permuting rather than clamping the latent features, is poorly disentangled: the latent shape encoding is more responsive to pose than to shape! In particular, latent shape proximity in this model is more indicative of pose alignment () than shape similarity.
4.4 Decoder disentanglement
To assess disentanglement on the level of the decoder, we attempted to generate shapes while holding one or the other feature fixed.
For fixedpose, variableshape generation, we took a pose encoding from a fixed real mesh and generated shape encodings where was the observed latent scale across the dataset. For random pose generation, since our model overparametrizes the true pose, the latent pose distribution does not, in practice, occupy the entire latent (pose) space – making it a challenge to generate suitable random poses. To get around this problem, we examined the model’s latent pose distribution using a modified PCA, computed at inference from the pose encodings of 120k random training meshes. The top 80 principal components account for of the latent pose distribution variance. We then generated Gaussian random pose vectors
within these principal axes (weighted by the singular values) and combined them with a fixed shape encoding
from a fixed mesh from the test set. See Fig. 4. A more detailed analysis of performance relative to latent dimensions is in the appendix.Note that the level of variation (particularly in body shape) in the generated meshes goes beyond that of the training set. We view these results as compound benefits of having both a disentangled and geometricallybased model: we are able to vary the mesh in a controlled way, while our geometric priors, such as the distortion term (1), ensure that variation in vertex predictions is smoothed out locally to form plausible mesh deformations, rather than just degrading the mesh (see Fig. 5 right).


4.5 Pose and shape transfer
A primary application of a disentangled model is to transfer poses and body shapes from one mesh to another. We used the dataset of triples constructed for the pose transfer baseline (see Sec. 4.2). While the baseline model is trained directly on this task, the disentangled model instead produces by combining the appropriate latent attributes from and . See Fig. 7.
To ensure that the task presented a challenge for the model, we tested on a subset of triples, requiring the secondary meshes to have extrinsic mean vertex distance from of at least cm for and cm for . Surprisingly, the primary model outperformed the model trained directly on triples (Fig. 2).
4.6 Pose synchronization on Faust
Next, we evaluated the model on a pose synchronization task using dynamic time warping (DTW, Sakoe and Chiba (1978)). Given two sequences and costs (energies) DTW produces a sequence of pairs such that , , and every is matched to at least one and vice versa, minimizing the total energy. We performed the synchronization task using cost given by distance between latent pose encodings for and . This task is especially interesting on Faust because the sequences consist of similar motions (jumping jacks, running on the spot, and so on) but are not synchronized and do not have pose labels. See Fig. 6.
4.7 Shape recognition on SMPL
As a complementary experiment to pose synchronization, we performed a shape recognition task: we selected subjects at random from the test set, and for each subject chose two random meshes . We then had the model predict, for each , which of the counterparts comes from the same subject. We used nearestneighbors assignment based on the shape encoding only (MeshVAED, MeshVAEP) or the overall latent encoding (MeshVAED). Our model significantly outperforms the baseline on this task (Fig. 5), correctly identifying nearly half the meshes (top1 score) and 2/3 (top3 score). We also observe a moderate improvement over MeshVAEP, for which the latent representation is only partially disentangled.
4.8 Latent pose interpolation
To explore the local structure of the model’s latent space, we took motion sequences and conducted pose interpolation between nearby frames, using linear interpolation in the latent space. The interpolation results in additional semantic detail, such as armbending between two poses with almost straight arms. See Fig.
7 for a comparison with naive extrinsic linear interpolation of vertex coordinates in , which causes unrealistic mesh deformations.target  swap  direct  

(n/a)  (n/a) 
5 Conclusion and Future Work
This paper introduces a disentangled meshconvolutional VAE. With careful consideration of the supervision and training design, we see that our proposed model can achieve accurate disentanglement while capturing the varied pose and shape properties in largescale mesh datasets.
Given these promising results, in future work we will explore two directions: (1) model improvements by extending our current design by incorporating techniques from alternate generative approaches (e.g. VAEGAN), and (2) domain transfer to 3D data captured in the wild (e.g. captured with commodity RGBD sensors).
References
 Achille and Soatto [2018] A. Achille and S. Soatto. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19(50):1–34, 2018. URL http://jmlr.org/papers/v19/17646.html.
 Achlioptas et al. [2018] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3D point clouds. In International Conference on Machine Learning, (ICML), volume 80, pages 40–49, 2018.
 Allen et al. [2006] B. Allen, B. Curless, Z. Popović, and A. Hertzmann. Learning a correlated model of identity and posedependent body shape variation for realtime synthesis. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’06, pages 147–156, AirelaVille, Switzerland, Switzerland, 2006. Eurographics Association. ISBN 3905673347. URL http://dl.acm.org/citation.cfm?id=1218064.1218084.
 Anguelov et al. [2005] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. Scape: Shape completion and animation of people. In ACM SIGGRAPH 2005 Papers, SIGGRAPH ’05, pages 408–416, New York, NY, USA, 2005. ACM. doi: 10.1145/1186822.1073207. URL http://doi.acm.org/10.1145/1186822.1073207.
 Barron and Malik [2015] J. T. Barron and J. Malik. Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell., 37(8):1670–1687, 2015.
 Barrow and Tenenbaum [1978] H. G. Barrow and J. M. Tenenbaum. Recovering intrinsic scene characteristics from images. In Computer Vision Systems, 1978.
 Bengio et al. [2013] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, 2013.

Bogo et al. [2017]
F. Bogo, J. Romero, G. PonsMoll, and M. J. Black.
Dynamic FAUST: Registering human bodies in motion.
In
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)
, July 2017.  Borosan et al. [2012] P. Borosan, M. Jin, D. DeCarlo, Y. Gingold, and A. Nealen. RigMesh: Automatic rigging for partbased shape modeling and deformation. ACM Transactions on Graphics (TOG), 31(6):198:1–198:9, 2012.
 Cashman and Fitzgibbon [2012] T. J. Cashman and A. W. Fitzgibbon. What shape are dolphins? building 3d morphable models from 2d images. IEEE transactions on pattern analysis and machine intelligence, 35(1):232–244, 2012.
 Chan et al. [2018] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance now. CoRR, abs/1808.07371, 2018. URL http://arxiv.org/abs/1808.07371.
 Chang et al. [2015] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An informationrich 3d model repository. CoRR, abs/1512.03012, 2015.
 Chen et al. [2016] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Neural Information Processing Systems, pages 2180–2188, 2016.
 CMU MoCap [1999] CMU MoCap. Carnegiemellon mocap database, 1999. URL http://mocap.cs.cmu.edu/.
 de Bem et al. [2018] R. de Bem, A. Ghosh, T. Ajanthan, O. Miksik, N. Siddharth, and P. H. Torr. Dgpose: Disentangled semisupervised deep generative models for human body analysis. arXiv preprint arXiv:1804.06364, 2018.
 Denton and Birodkar [2017] E. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, pages 4414–4423, 2017.
 Dubrovina et al. [2019] A. Dubrovina, F. Xia, P. Achlioptas, M. Shalah, and L. J. Guibas. Composite shape modeling via latent space factorization. CoRR, abs/1901.02968, 2019. URL http://arxiv.org/abs/1901.02968.
 Freifeld and Black [2012] O. Freifeld and M. J. Black. Lie bodies: A manifold representation of 3d human shape. In European Conference on Computer Vision, pages 1–14. Springer, 2012.
 Gao et al. [2016] L. Gao, Y.K. Lai, D. Liang, S.Y. Chen, and S. Xia. Efficient and flexible deformation representation for datadriven surface modeling. ACM Transactions on Graphics (TOG), 35(5):158, 2016.
 Gao et al. [2017] L. Gao, Y.K. Lai, J. Yang, L.X. Zhang, L. Kobbelt, and S. Xia. Sparse data driven mesh deformation. arXiv preprint arXiv:1709.01250, 2017.
 Gao et al. [2018] L. Gao, J. Yang, Y.L. Qiao, Y.K. Lai, P. L. Rosin, W. Xu, and S. Xia. Automatic unpaired shape deformation transfer. ACM Trans. Graph., 37(6):237:1–237:15, Dec. 2018. ISSN 07300301. doi: 10.1145/3272127.3275028. URL http://doi.acm.org/10.1145/3272127.3275028.
 Goodfellow et al. [2014] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 Hasler et al. [2009] N. Hasler, C. Stoll, M. Sunkel, B. Rosenhahn, and H.P. Seidel. A statistical model of human pose and body shape. In Computer graphics forum, volume 28, pages 337–346. Wiley Online Library, 2009.
 Hinton et al. [2011] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming autoencoders. In Proceedings of the 21st International Conference on Artificial Neural Networks (ICANN), pages 44–51, 2011.
 Hirshberg et al. [2012] D. A. Hirshberg, M. Loper, E. Rachlin, and M. J. Black. Coregistration: Simultaneous alignment and modeling of articulated 3d shape. In European Conference on Computer Vision, pages 242–255. Springer, 2012.
 Kim and Mnih [2018] H. Kim and A. Mnih. Disentangling by factorising. In International Conference on Machine Learning, pages 2649–2658, 2018.
 Kingma and Welling [2014] D. P. Kingma and M. Welling. Autoencoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.
 Kingma et al. [2014] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, and M. Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
 Kulkarni et al. [2015] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Advances in neural information processing systems, pages 2539–2547, 2015.
 Litany et al. [2018] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. Deformable shape completion with graph convolutional autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1886–1895, 2018.
 Liu et al. [2018] Y. Liu, F. Wei, J. Shao, L. Sheng, J. Yan, and X. Wang. Exploring disentangled feature representation beyond face identification. In CVPR, 2018.
 Locatello et al. [2018] F. Locatello, S. Bauer, M. Lucic, S. Gelly, B. Schölkopf, and O. Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. CoRR, abs/1811.12359, 2018.
 Loper et al. [2015] M. Loper, N. Mahmood, J. Romero, G. PonsMoll, and M. J. Black. SMPL: A skinned multiperson linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015.
 Loper et al. [2014] M. M. Loper, N. Mahmood, and M. J. Black. MoSh: Motion and shape capture from sparse markers. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 33(6):220:1–220:13, Nov. 2014. URL http://doi.acm.org/10.1145/2661229.2661273.
 Mathieu et al. [2016] M. Mathieu, J. J. Zhao, P. Sprechmann, A. Ramesh, and Y. LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pages 5041–5049, 2016.
 Narayanaswamy et al. [2017] S. Narayanaswamy, T. B. Paige, J.W. van de Meent, A. Desmaison, N. Goodman, P. Kohli, F. Wood, and P. Torr. Learning disentangled representations with semisupervised deep generative models. In Advances in Neural Information Processing Systems, pages 5925–5935, 2017.
 Ntouskos et al. [2015] V. Ntouskos, M. Sanzari, B. Cafaro, F. Nardi, F. Natola, F. Pirri, and M. Ruiz. Componentwise modeling of articulated objects. In Proceedings of the IEEE International Conference on Computer Vision, pages 2327–2335, 2015.
 Pauly et al. [2005] M. Pauly, N. J. Mitra, J. Giesen, M. Gross, and L. J. Guibas. Examplebased 3d scan completion. In Proceedings of the Third Eurographics Symposium on Geometry Processing, SGP ’05, AirelaVille, Switzerland, Switzerland, 2005. Eurographics Association. ISBN 390567324X. URL http://dl.acm.org/citation.cfm?id=1281920.1281925.
 Pishchulin et al. [2017] L. Pishchulin, S. Wuhrer, T. Helten, C. Theobalt, and B. Schiele. Building statistical shape spaces for 3d human modeling. Pattern Recognition, 2017.
 PonsMoll et al. [2015] G. PonsMoll, J. Romero, N. Mahmood, and M. J. Black. Dyna: A model of dynamic human shape in motion. ACM Transactions on Graphics, (Proc. SIGGRAPH), 34(4):120:1–120:14, Aug. 2015.
 Robinette et al. [1999] K. Robinette, H. Daanen, and E. Paquet. The caesar project: a 3d surface anthropometry survey. In 3D Imaging and Modelling, pages 380 – 386, 1999.
 Sakoe and Chiba [1978] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 26:43–49, 1978.
 Schmidhuber [1992] J. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4:863–879, 1992.
 Shu et al. [2017] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In IEEE Conference on Computer Vision and Pattern Recognition, 2017, 2017.
 Shu et al. [2018] Z. Shu, M. Sahasrabudhe, R. A. Güler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming Autoencoders: Unsupervised Disentangling of Shape and Appearance. In The European Conference on Computer Vision, (ECCV), 2018.
 Sumner and Popović [2004] R. W. Sumner and J. Popović. Deformation transfer for triangle meshes. ACM Trans. Graph., 23(3):399–405, 2004.
 Tan et al. [2018] Q. Tan, L. Gao, Y. Lai, and S. Xia. Variational autoencoders for deforming 3d mesh models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5841–5850, 2018.
 Tran et al. [2017] L. Tran, X. Yin, and X. Liu. Disentangled representation learning gan for poseinvariant face recognition. In In Proceeding of IEEE Computer Vision and Pattern Recognition, Honolulu, HI, July 2017.
 Usman et al. [2019] B. Usman, N. Dufour, K. Saenko, and C. Bregler. Puppetgan: Transferring disentangled properties from synthetic to real images. CoRR, abs/1901.10024, 2019. URL http://arxiv.org/abs/1901.10024.
 Varol et al. [2017] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid. Learning from synthetic humans. In CVPR, 2017.
 Verma et al. [2018] N. Verma, E. Boyer, and J. Verbeek. Feastnet: Featuresteered graph convolutions for 3d shape analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2598–2606, 2018.
 Villegas et al. [2017] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. In International Conference on Learning Representations, 2017.
 Worrall et al. [2017] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Interpretable transformations with encoderdecoder networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
 Wu et al. [2016] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
 Yang et al. [2015] J. Yang, S. E. Reed, M.H. Yang, and H. Lee. Weaklysupervised disentangling with recurrent transformations for 3d view synthesis. In Advances in Neural Information Processing Systems, pages 1099–1107, 2015.
 Yang et al. [2014] Y. Yang, Y. Yu, Y. Zhou, S. Du, J. Davis, and R. Yang. Semantic parametric reshaping of human body models. In 2014 2nd International Conference on 3D Vision, volume 2, pages 41–48. IEEE, 2014.
 Zanfir et al. [2018] M. Zanfir, A.I. Popa, A. Zanfir, and C. Sminchisescu. Human appearance transfer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 Zuffi et al. [2018] S. Zuffi, A. Kanazawa, and M. J. Black. Lions and tigers and bears: Capturing nonrigid, 3D, articulated shape from images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2018.
Appendix A Appendix
a.1 Network architecture
Our network architecture used the hidden layer widths indicated in Fig. S8
. All models were trained in Tensorflow with an Adam optimizer with learning rate decaying exponentially from
to over 400k steps. The articulated cylinder models typically converged within 100k steps.Cylinders  Layers / Units 

Encoder  16 (1x1 conv), 24, 32, 48, 64 (FeaStNet), meanpooling 
Latent space  4N = 3N (shape) + 1N (pose), N = 1, 2, 3 
Decoder  (FC), 64, 48, 32, 24, 16 (FeaStNet), 3 (1x1 conv) 
Human shapes  Layers / Units 
Encoder  16 (1x1 conv), 32, 64, 96, 128 (FeaStNet), meanpooling 
Latent space  128 = 16 (shape) + 112 (pose) 
Decoder  (FC), 128, 96, 64, 32, 16 (FeaStNet), 3 (1x1 conv) 
Discriminator  16 (1x1 conv), 32, 64 (FeaStNet), meanpooling, then 1 (FC) 
a.2 Experiments
a.2.1 Articulated cylinders
We evaluated disentanglement on the articulated cylinders by evaluating the explicit shape and pose parameters from the reconstructed meshes (see Section 4.1 of paper). For test set, we used holdout ranges of parameters, . For a holdout set of meshes, we compared latent RMSE (see Fig. S9) and Pearson’s correlation (see Fig. S10) for direct reconstructions vs random shapepose swaps. We noticed the error in shape latent was almost identical, whereas pose latent exhibited a larger error after swap.
Parameter  Latent swap  Direct 

10.25375  1.5044  
0.04784  0.04711  
0.06322  0.0382  
0.00545  0.0051 
Parameter  

0.99961  0.01074  0.01690  0.00540  
0.00085  0.99973      
0.00705    0.99919    
0.02340      0.98127 
a.2.2 Discussion on PCA reduction for latent pose generation for human shapes
Our model overparametrizes the latent pose vector (see Fig. S11
), offering the possibility of reducing the latent dimensionality after training has finished. For random pose generation, we found it beneficial to first reduce the latent pose space to 80 latent dimensions, by conducting a principal component analysis of the empirical latent distribution (based on the latent pose encodings of approximately 120k meshes from the training set; we found similar results using PCA computed from 40k and 220k meshes) and sampling latent vectors from the top principal components.
A natural question is whether this dimensionality reduction amounts to training with fewer latent features. We compared MeshVAED, with 112 latent pose features (called MeshVAED112), to a smaller MeshVAED with 80 latent pose features (called MeshVAED80), and found significant impairment to the reconstructed meshes from the smaller model. Specifically, MeshVAED80 had higher mean vertex error for reconstructed meshes (cm MVE, compared to cm MVE for MeshVAED112), and larger latent embedding variance loss (between meshes with common shape or pose). In contrast, applying the latent PCA projection to MeshVAED112 had negligible impacts on mesh reconstruction, increasing the reconstructed mesh MVE by only cm (compared to MVE=cm for MeshVAED112 without PCA projection). We also note that the PCA on MeshVAED112 transformed the latent pose vectors by of their norm. Thus, training a model with smaller latent features results in larger reconstruction and disentanglement errors compared to PCA reduction on the latent features of a larger model.
a.2.3 Highres images of mesh images
For the reader’s enjoyment, we include larger images of the human meshes from Figures 37. Meshes colored in blue are originals; grey and white meshes are model outputs.
diff_pose  diff_subject  target  combine  direct 

(n/a)  (n/a) 
Model  

Data  
Extrinsic 
Comments
There are no comments yet.